pith. sign in

arxiv: 2604.13268 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.CL· cs.IR

Indexing Multimodal Language Models for Large-scale Image Retrieval

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR
keywords multimodal large language modelsimage retrievalzero-shot re-rankingsimilarity estimationlarge-scale indexingvision-language models
0
0 comments X

The pith

Multimodal large language models can act as zero-shot similarity estimators for large-scale image retrieval by scoring image pairs through next-token probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multimodal large language models can estimate visual similarity between images without training or fine-tuning. It prompts the model with two images at once and turns the probability assigned to the next token into a similarity score, then applies this score to re-rank top candidates from an initial retrieval system. This uses the visual knowledge already present in the model's multimodal training. Experiments on several benchmarks show the approach beats task-specific re-rankers when the test domain differs from the model's usual strengths and handles clutter, occlusion, and small objects better than alternatives. The work also notes specific failure cases when objects undergo large appearance shifts.

Core claim

Prompting an MLLM with a pair of images and converting the next-token probabilities into a similarity score produces an effective training-free re-ranker that, when combined with memory-efficient indexing, improves instance-level image retrieval performance across diverse benchmarks and remains robust to clutter, occlusion, and small objects.

What carries the argument

Prompting an MLLM with paired images and mapping next-token probabilities to similarity scores for zero-shot re-ranking of indexed candidates.

If this is right

  • Large-scale retrieval pipelines can avoid domain-specific re-rankers by substituting an off-the-shelf MLLM for the ranking stage.
  • The same prompting method scales to new visual domains because it relies only on the model's pre-trained multimodal knowledge.
  • Memory-efficient indexing plus top-k re-ranking keeps the approach practical even when the candidate pool is millions of images.
  • Failure under extreme appearance variation points to a concrete limit that future prompting or adaptation strategies would need to address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-probability trick might transfer to other vision-only tasks such as clustering or duplicate detection without additional training.
  • If the approach generalizes, it reduces the engineering cost of maintaining separate retrieval models for each new visual domain.
  • Combining MLLM scores with existing geometric verification steps could further tighten precision on hard cases.

Load-bearing premise

Next-token probabilities from an MLLM prompted with two images directly indicate how visually similar those images are at the instance level, without needing any adaptation or fine-tuning.

What would settle it

A controlled test on a benchmark containing images with severe viewpoint or lighting changes where the MLLM re-ranker ranks correct matches lower than a standard task-specific re-ranker or a simple feature baseline.

Figures

Figures reproduced from arXiv: 2604.13268 by Bahey Tharwat, Giorgos Kordopatis-Zilos, Giorgos Tolias, Ian Reid, Pavel Suma.

Figure 1
Figure 1. Figure 1: Performance vs. re-ranking time. MLLM-based re-rankers have higher per-image inference time than methods trained specifically for instance-level image retrieval with re-ranking (AMES), but under a fixed query-time budget, they achieve better retrieval performance, already with as few as 20 re-ranked images, indicated by the colored numbers. All reported methods have roughly the same memory footprint. from … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed MLLM-based re-ranking approach for instance-level image retrieval. A query image, a database image, and a task-specific textual prompt are jointly used as input for similarity estimation. A vision encoder extracts visual tokens from both images (Xq, Xdb), which are concatenated with the textual prompt tokens (T) and the end-of-sentence (EOS) token to form the multimodal input to th… view at source ↗
Figure 3
Figure 3. Figure 3: Performance vs. memory. Memory footprint per image is approximated assuming a fixed resolution, where the longer image side is set to the target value and the shorter side is scaled to 3/4 of it, reflecting a typical aspect ratio. Five models for re-ranking, operating on top of global similarities obtained from the PE model with linear adaptation. All indexing strategies are applied on the Qwen model with … view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across different amounts of object area coverage and background clutter. Positives across all queries are jointly ranked based on coverage or clutter and split into 4 equal-sized groups. mAP@1k averaged over the queries in each group. Comparison between Qwen with and without com￾pression (Qwen-C), AMES, and PE. Robustness to scale and clutter on ILIAS. Following the original work [26… view at source ↗
Figure 5
Figure 5. Figure 5: Robustness analysis of MLLMs under controlled transformations. Similarity scores for positive image pairs are shown across ten types of visual perturbations: contrast, brightness, rotation, scale, background scaling, blur, tiling, noise, clutter, and occlusion. 50 queries from ILIAS are used to generate positive pairs. Dashed lines indicate the average hard-negative similarity of the 5th percentile of the … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples where one method benefits the most compared to another. We compare global (PE) and AMES vs. Qwen by showing pairs of query and positive image. → indicates the number of negative images ranked before the positive for two models, and it goes from the weaker to the stronger model for each pair. Performance on multiple datasets. In Tab. 1, we com￾pare various re-ranking methods across mult… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using Multimodal Large Language Models (MLLMs) as training-free similarity estimators for instance-level image-to-image retrieval. It prompts MLLMs with paired images, converts next-token probabilities into similarity scores for zero-shot re-ranking, and combines this with memory-efficient indexing for scalability. Experiments on diverse benchmarks are claimed to show MLLMs outperforming task-specific re-rankers outside native domains with superior robustness to clutter, occlusion, and small objects, while noting failure modes under severe appearance changes.

Significance. If the results hold after validation, the work would be significant as a demonstration of repurposing general-purpose MLLMs for vision-only retrieval without fine-tuning or custom architectures. The scalability approach via indexing addresses a practical barrier, and the robustness findings could influence open-world retrieval pipelines if the probability proxy is shown to be reliable.

major comments (2)
  1. [Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.
  2. [Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.
minor comments (2)
  1. [Abstract] The abstract refers to 'diverse benchmarks' without naming them; listing the specific datasets would improve reproducibility and context.
  2. [Method] Consider adding a diagram of the paired-image prompt template and score computation pipeline to clarify the core construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's identification of areas where greater clarity is needed to support the core claims. We will revise the manuscript to incorporate explicit methodological details and expanded experimental information, while preserving the original contributions.

read point-by-point responses
  1. Referee: [Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.

    Authors: We agree that an explicit formulation is necessary for the probability-to-similarity conversion. In the revised manuscript we will add the precise formula (mapping the next-token probability of an affirmative response token under the paired-image prompt to a similarity score), pseudocode for the full prompting and scoring procedure, and a new ablation study. The ablation will demonstrate monotonicity with ground-truth instance similarity, invariance across prompt phrasings, and a direct comparison against the MLLM's native visual embeddings to quantify any advantage of the probability-based proxy. revision: yes

  2. Referee: [Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.

    Authors: We acknowledge the need for full reproducibility details. The revised manuscript will explicitly enumerate all evaluation datasets and their characteristics, list every baseline (including task-specific re-rankers), provide the exact prompting templates used, restate the probability conversion formula, and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) on the performance differences. We will also expand the robustness analysis with quantitative breakdowns for clutter, occlusion, and small-object subsets, including failure-case examples under severe appearance change. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical prompting method is self-contained

full rationale

The paper describes a training-free method that prompts pre-trained MLLMs with image pairs and converts next-token probabilities into similarity scores for re-ranking. No equations, parameter fitting, derivations, or self-referential chains appear in the approach or claims. Results rest on direct experimental evaluation across external benchmarks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any central premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or mentioned in the abstract; the approach depends entirely on off-the-shelf pre-trained MLLMs and standard prompting.

pith-pipeline@v0.9.0 · 5476 in / 1113 out tokens · 58053 ms · 2026-05-10T16:02:10.584052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Reynolds

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, and Malcolm el al. Reynolds. Flamingo: a visual language model for few-shot learning. InNeurIPS,

  2. [2]

    DivPrune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 4

  3. [3]

    NetVLAD: CNN architecture for weakly supervised place recognition

    Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. InCVPR, 2016. 1

  4. [4]

    MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. InarXiv, 2024. 1

  5. [5]

    Aggregating deep convolutional features for image retrieval

    Artem Babenko and Victor Lempitsky. Aggregating deep convolutional features for image retrieval. InICCV, 2015. 2

  6. [6]

    Qwen2.5-VL technical report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. InarXiv, 2025. 2, 5, 7

  7. [7]

    Perception Encoder: The best visual embeddings are not at the output of the net- work

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS, 2025. 7

  8. [8]

    Unifying deep local and global features for image search

    Bingyi Cao, Andr ´e Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 1, 5

  9. [9]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2

  10. [10]

    YOLO-World: Real-time open- vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open- vocabulary object detection. InCVPR, 2024. 1

  11. [11]

    InstructBLIP: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICML, 2021. 3

  13. [13]

    The Llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. InarXiv, 2024. 1, 2

  14. [14]

    Composed image retrieval for training-free domain conversion

    Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstanti- nos Karantzalos, Yannis Avrithis, Ondˇrej Chum, and Giorgos Tolias. Composed image retrieval for training-free domain conversion. InWACV, 2025. 2

  15. [15]

    FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval

    Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao. FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval. InSIGIR, 2025. 2

  16. [16]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 2

  17. [17]

    FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity

    Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity. In CVPR, 2025. 1

  18. [18]

    CoLLM: A large language model for composed image retrieval

    Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. CoLLM: A large language model for composed image retrieval. InCVPR, 2025. 2

  19. [19]

    Product quantization for nearest neighbor search.PAMI, 2011

    Herv ´e J´egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.PAMI, 2011. 4

  20. [20]

    E5-V: Universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: Universal embeddings with multi- modal large language models. InarXiv, 2024. 2

  21. [21]

    VLM2Vec: Training vision- language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training vision- language models for massive multimodal embedding tasks. InICLR, 2025. 2

  22. [22]

    BRA VE: Broadening the visual encoding of vision-language models

    O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. BRA VE: Broadening the visual encoding of vision-language models. InECCV, 2024. 2

  23. [23]

    Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bo- hyung Han, Kyoung Mu Lee, and Honglak et al. Lee. NICE: CVPR 2023 challenge on zero-shot image captioning. In CVPR, 2024. 1

  24. [24]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1

  25. [25]

    DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022

    Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Pa- padopoulos, Ioannis Kompatsiaris, and Ioannis Patras. DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022. 2

  26. [26]

    ILIAS: Instance-level image retrieval at scale

    Giorgos Kordopatis-Zilos, Vladan Stojni ´c, Anna Manko, Pavel ˇSuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymi- adis, Zakaria Laskar, Jiˇr´ı Matas, Ondˇrej Chum, and Giorgos Tolias. ILIAS: Instance-level image retrieval at scale. In CVPR, 2025. 5, 6, 7, 3

  27. [27]

    Correlation verification for image retrieval

    Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval. InCVPR,

  28. [28]

    Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024

    Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024. 2

  29. [29]

    TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025. 2

  30. [30]

    MM-Embed: Universal multimodal retrieval with multimodal LLMs

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal multimodal retrieval with multimodal LLMs. In ICLR, 2025. 2

  31. [31]

    IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval

    Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval. InICCVW, 2025. 2

  32. [32]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

  33. [33]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

  34. [34]

    LamRA: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. LamRA: Large multimodal model as your advanced retrieval assistant. InCVPR, 2025. 2, 5, 7, 3

  35. [35]

    Sampling wisely: Deep image embedding by top-k precision optimization

    Jing Lu, Chaofan Xu, Wei Zhang, Ling-Yu Duan, and Tao Mei. Sampling wisely: Deep image embedding by top-k precision optimization. InICCV, 2019. 2

  36. [36]

    Gpt-4o system card

    OpenAI. Gpt-4o system card. InarXiv, 2024. 1

  37. [37]

    Object retrieval with large vocabularies and fast spatial matching

    James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. InCVPR, 2007. 1, 3

  38. [38]

    Lost in quantization: Improving particu- lar object retrieval in large scale image databases

    James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particu- lar object retrieval in large scale image databases. InCVPR,

  39. [39]

    Instance-level composed image retrieval

    Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagio- tis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. Instance-level composed image retrieval. InNeurIPS, 2025. 2

  40. [40]

    Revisiting oxford and paris: Large-scale image retrieval benchmarking

    Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 5, 3

  41. [41]

    Fine- tuning cnn image retrieval with no human annotation.PAMI,

    Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation.PAMI,

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2

  43. [43]

    DINO-X: A unified vision model for open-world object detection and understanding

    Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. DINO-X: A unified vision model for open-world object detection and understanding. InarXiv,

  44. [44]

    Learning with average precision: Train- ing image retrieval with a listwise loss

    Jerome Revaud, Jon Almaz ´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Train- ing image retrieval with a listwise loss. InICCV, 2019. 2

  45. [45]

    Facenet: A unified embedding for face recognition and clus- tering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InCVPR, 2015. 2

  46. [46]

    LAION-5B: An open large-scale dataset for train- ing next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. LAION-5B: An open large-scale dataset for train- ing next generation image-text models. InNeurIPS, 2022. 5

  47. [47]

    LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2

  48. [48]

    LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025

    Hengcan Shi, Son Duy Dao, and Jianfei Cai. LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025. 1

  49. [49]

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3. InarXiv, 2025. 5, 7

  50. [50]

    Video Google: A text retrieval approach to object matching in videos

    Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching in videos. InICCV,

  51. [51]

    Open-world semantic segmen- tation including class similarity

    Matteo Sodano, Federico Magistri, Lucas Nunes, Jens Behley, and Cyrill Stachniss. Open-world semantic segmen- tation including class similarity. InCVPR, 2024. 1

  52. [52]

    Improved deep metric learning with multi- class n-pair loss objective

    Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InNeurIPS, 2016. 2

  53. [53]

    AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval

    Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, and Giorgos Tolias. AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval. InECCV,

  54. [54]

    Elvis: Efficient visual similarity from local descriptors that generalizes across domains

    Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, and Giorgos Tolias. Elvis: Efficient visual similarity from local descriptors that generalizes across domains. InICLR,

  55. [55]

    Instance- level image retrieval using reranking transformers

    Fuwen Tan, Jiangbo Yuan, and Vicente Ordonez. Instance- level image retrieval using reranking transformers. InCVPR,

  56. [56]

    Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

    Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InCVPR, 2025. 2

  57. [57]

    Gemini: a family of highly capable multimodal models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. InarXiv, 2023. 1

  58. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017. 3

  59. [59]

    Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. InarXiv, 2024. 2

  60. [60]

    INSTRE: A new bench- mark for instance-level object retrieval and recognition

    Shuang Wang and Shuqiang Jiang. INSTRE: A new bench- mark for instance-level object retrieval and recognition. TOMM, 2015. 5, 3

  61. [61]

    InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. InarXiv, 2025. 2, 5

  62. [62]

    UniIR: Train- ing and benchmarking universal multimodal information re- trievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Train- ing and benchmarking universal multimodal information re- trievers. InECCV, 2024. 2

  63. [63]

    Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, Andr ´e Araujo, Bingyi Cao, and Jack Sim. Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020. 1, 5

  64. [64]

    LOCORE: Image re-ranking with long-context se- quence modeling

    Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, and Vicente Or- donez. LOCORE: Image re-ranking with long-context se- quence modeling. InCVPR, 2025. 2

  65. [65]

    Qwen3 technical report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. InarXiv,

  66. [66]

    DetCLIPv3: To- wards versatile generative open-vocabulary object detection

    Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: To- wards versatile generative open-vocabulary object detection. InCVPR, 2024. 1

  67. [67]

    Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining

    Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Min- long Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. InICCV, 2021. 5, 3

  68. [68]

    Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models. In CVPR, 2025. 2

  69. [69]

    Qwen3 embedding: Advancing text embedding and reranking through foundation models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. In arXiv, 2025. 5

  70. [70]

    R2former: Unified retrieval and reranking transformer for place recognition

    Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiao- hui Shen, and Heng Wang. R2former: Unified retrieval and reranking transformer for place recognition. InCVPR, 2023. 2