Indexing Multimodal Language Models for Large-scale Image Retrieval

Bahey Tharwat; Giorgos Kordopatis-Zilos; Giorgos Tolias; Ian Reid; Pavel Suma

arxiv: 2604.13268 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.CL· cs.IR

Indexing Multimodal Language Models for Large-scale Image Retrieval

Bahey Tharwat , Giorgos Kordopatis-Zilos , Pavel Suma , Ian Reid , Giorgos Tolias This is my paper

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR

keywords multimodal large language modelsimage retrievalzero-shot re-rankingsimilarity estimationlarge-scale indexingvision-language models

0 comments

The pith

Multimodal large language models can act as zero-shot similarity estimators for large-scale image retrieval by scoring image pairs through next-token probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multimodal large language models can estimate visual similarity between images without training or fine-tuning. It prompts the model with two images at once and turns the probability assigned to the next token into a similarity score, then applies this score to re-rank top candidates from an initial retrieval system. This uses the visual knowledge already present in the model's multimodal training. Experiments on several benchmarks show the approach beats task-specific re-rankers when the test domain differs from the model's usual strengths and handles clutter, occlusion, and small objects better than alternatives. The work also notes specific failure cases when objects undergo large appearance shifts.

Core claim

Prompting an MLLM with a pair of images and converting the next-token probabilities into a similarity score produces an effective training-free re-ranker that, when combined with memory-efficient indexing, improves instance-level image retrieval performance across diverse benchmarks and remains robust to clutter, occlusion, and small objects.

What carries the argument

Prompting an MLLM with paired images and mapping next-token probabilities to similarity scores for zero-shot re-ranking of indexed candidates.

If this is right

Large-scale retrieval pipelines can avoid domain-specific re-rankers by substituting an off-the-shelf MLLM for the ranking stage.
The same prompting method scales to new visual domains because it relies only on the model's pre-trained multimodal knowledge.
Memory-efficient indexing plus top-k re-ranking keeps the approach practical even when the candidate pool is millions of images.
Failure under extreme appearance variation points to a concrete limit that future prompting or adaptation strategies would need to address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-probability trick might transfer to other vision-only tasks such as clustering or duplicate detection without additional training.
If the approach generalizes, it reduces the engineering cost of maintaining separate retrieval models for each new visual domain.
Combining MLLM scores with existing geometric verification steps could further tighten precision on hard cases.

Load-bearing premise

Next-token probabilities from an MLLM prompted with two images directly indicate how visually similar those images are at the instance level, without needing any adaptation or fine-tuning.

What would settle it

A controlled test on a benchmark containing images with severe viewpoint or lighting changes where the MLLM re-ranker ranks correct matches lower than a standard task-specific re-ranker or a simple feature baseline.

Figures

Figures reproduced from arXiv: 2604.13268 by Bahey Tharwat, Giorgos Kordopatis-Zilos, Giorgos Tolias, Ian Reid, Pavel Suma.

**Figure 1.** Figure 1: Performance vs. re-ranking time. MLLM-based re-rankers have higher per-image inference time than methods trained specifically for instance-level image retrieval with re-ranking (AMES), but under a fixed query-time budget, they achieve better retrieval performance, already with as few as 20 re-ranked images, indicated by the colored numbers. All reported methods have roughly the same memory footprint. from … view at source ↗

**Figure 2.** Figure 2: Overview of the proposed MLLM-based re-ranking approach for instance-level image retrieval. A query image, a database image, and a task-specific textual prompt are jointly used as input for similarity estimation. A vision encoder extracts visual tokens from both images (Xq, Xdb), which are concatenated with the textual prompt tokens (T) and the end-of-sentence (EOS) token to form the multimodal input to th… view at source ↗

**Figure 3.** Figure 3: Performance vs. memory. Memory footprint per image is approximated assuming a fixed resolution, where the longer image side is set to the target value and the shorter side is scaled to 3/4 of it, reflecting a typical aspect ratio. Five models for re-ranking, operating on top of global similarities obtained from the PE model with linear adaptation. All indexing strategies are applied on the Qwen model with … view at source ↗

**Figure 4.** Figure 4: Performance comparison across different amounts of object area coverage and background clutter. Positives across all queries are jointly ranked based on coverage or clutter and split into 4 equal-sized groups. mAP@1k averaged over the queries in each group. Comparison between Qwen with and without compression (Qwen-C), AMES, and PE. Robustness to scale and clutter on ILIAS. Following the original work [26… view at source ↗

**Figure 5.** Figure 5: Robustness analysis of MLLMs under controlled transformations. Similarity scores for positive image pairs are shown across ten types of visual perturbations: contrast, brightness, rotation, scale, background scaling, blur, tiling, noise, clutter, and occlusion. 50 queries from ILIAS are used to generate positive pairs. Dashed lines indicate the average hard-negative similarity of the 5th percentile of the … view at source ↗

**Figure 6.** Figure 6: Qualitative examples where one method benefits the most compared to another. We compare global (PE) and AMES vs. Qwen by showing pairs of query and positive image. → indicates the number of negative images ranked before the positive for two models, and it goes from the weaker to the stronger model for each pair. Performance on multiple datasets. In Tab. 1, we compare various re-ranking methods across mult… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs can be prompted with image pairs to produce zero-shot similarity scores for retrieval re-ranking, but the abstract gives no method details or numbers so the claims stay unverified.

read the letter

The punchline is that this paper takes existing multimodal LLMs and uses them as training-free image similarity estimators by feeding paired images and turning next-token probabilities into scores for re-ranking top-k candidates from a fast index. That framing for large-scale vision-only retrieval is the main new angle they push, separate from prior work on task-specific re-rankers or fine-tuned models. They also note robustness gains on clutter, occlusion, and small objects, plus some failure modes under big appearance changes, which keeps the write-up from overclaiming. The approach reuses pre-trained models without new architectures or adaptation, which is a practical plus if it holds up. The soft spots are clear from the abstract alone: no datasets, no baselines, no exact formula for converting probabilities to scores, and no ablations on prompt sensitivity or statistical significance. Without those, it is impossible to tell whether the probabilities actually track instance-level visual similarity or just pick up coarser semantic signals. The stress-test concern lands here—the proxy needs explicit checks for monotonicity with visual distance before the robustness claims can be trusted. This is for retrieval researchers who already work with large models and want zero-shot options instead of training dedicated networks. A reader focused on practical multimodal reuse would get the most out of it. The paper deserves a serious referee because the core idea is straightforward and timely, even though the current evidence is thin. Send it to review so the full experiments, prompting details, and comparisons can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes using Multimodal Large Language Models (MLLMs) as training-free similarity estimators for instance-level image-to-image retrieval. It prompts MLLMs with paired images, converts next-token probabilities into similarity scores for zero-shot re-ranking, and combines this with memory-efficient indexing for scalability. Experiments on diverse benchmarks are claimed to show MLLMs outperforming task-specific re-rankers outside native domains with superior robustness to clutter, occlusion, and small objects, while noting failure modes under severe appearance changes.

Significance. If the results hold after validation, the work would be significant as a demonstration of repurposing general-purpose MLLMs for vision-only retrieval without fine-tuning or custom architectures. The scalability approach via indexing addresses a practical barrier, and the robustness findings could influence open-world retrieval pipelines if the probability proxy is shown to be reliable.

major comments (2)

[Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.
[Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.

minor comments (2)

[Abstract] The abstract refers to 'diverse benchmarks' without naming them; listing the specific datasets would improve reproducibility and context.
[Method] Consider adding a diagram of the paired-image prompt template and score computation pipeline to clarify the core construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's identification of areas where greater clarity is needed to support the core claims. We will revise the manuscript to incorporate explicit methodological details and expanded experimental information, while preserving the original contributions.

read point-by-point responses

Referee: [Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.

Authors: We agree that an explicit formulation is necessary for the probability-to-similarity conversion. In the revised manuscript we will add the precise formula (mapping the next-token probability of an affirmative response token under the paired-image prompt to a similarity score), pseudocode for the full prompting and scoring procedure, and a new ablation study. The ablation will demonstrate monotonicity with ground-truth instance similarity, invariance across prompt phrasings, and a direct comparison against the MLLM's native visual embeddings to quantify any advantage of the probability-based proxy. revision: yes
Referee: [Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.

Authors: We acknowledge the need for full reproducibility details. The revised manuscript will explicitly enumerate all evaluation datasets and their characteristics, list every baseline (including task-specific re-rankers), provide the exact prompting templates used, restate the probability conversion formula, and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) on the performance differences. We will also expand the robustness analysis with quantitative breakdowns for clutter, occlusion, and small-object subsets, including failure-case examples under severe appearance change. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical prompting method is self-contained

full rationale

The paper describes a training-free method that prompts pre-trained MLLMs with image pairs and converts next-token probabilities into similarity scores for re-ranking. No equations, parameter fitting, derivations, or self-referential chains appear in the approach or claims. Results rest on direct experimental evaluation across external benchmarks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any central premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or mentioned in the abstract; the approach depends entirely on off-the-shelf pre-trained MLLMs and standard prompting.

pith-pipeline@v0.9.0 · 5476 in / 1113 out tokens · 58053 ms · 2026-05-10T16:02:10.584052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Reynolds

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, and Malcolm el al. Reynolds. Flamingo: a visual language model for few-shot learning. InNeurIPS,

work page
[2]

DivPrune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 4

work page 2025
[3]

NetVLAD: CNN architecture for weakly supervised place recognition

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. InCVPR, 2016. 1

work page 2016
[4]

MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. InarXiv, 2024. 1

work page 2024
[5]

Aggregating deep convolutional features for image retrieval

Artem Babenko and Victor Lempitsky. Aggregating deep convolutional features for image retrieval. InICCV, 2015. 2

work page 2015
[6]

Qwen2.5-VL technical report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. InarXiv, 2025. 2, 5, 7

work page 2025
[7]

Perception Encoder: The best visual embeddings are not at the output of the net- work

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS, 2025. 7

work page 2025
[8]

Unifying deep local and global features for image search

Bingyi Cao, Andr ´e Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 1, 5

work page 2020
[9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2

work page 2024
[10]

YOLO-World: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open- vocabulary object detection. InCVPR, 2024. 1

work page 2024
[11]

InstructBLIP: Towards general-purpose vision- language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1

work page 2023
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICML, 2021. 3

work page 2021
[13]

The Llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. InarXiv, 2024. 1, 2

work page 2024
[14]

Composed image retrieval for training-free domain conversion

Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstanti- nos Karantzalos, Yannis Avrithis, Ondˇrej Chum, and Giorgos Tolias. Composed image retrieval for training-free domain conversion. InWACV, 2025. 2

work page 2025
[15]

FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval

Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao. FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval. InSIGIR, 2025. 2

work page 2025
[16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 2

work page 2022
[17]

FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity. In CVPR, 2025. 1

work page 2025
[18]

CoLLM: A large language model for composed image retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. CoLLM: A large language model for composed image retrieval. InCVPR, 2025. 2

work page 2025
[19]

Product quantization for nearest neighbor search.PAMI, 2011

Herv ´e J´egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.PAMI, 2011. 4

work page 2011
[20]

E5-V: Universal embeddings with multi- modal large language models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: Universal embeddings with multi- modal large language models. InarXiv, 2024. 2

work page 2024
[21]

VLM2Vec: Training vision- language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training vision- language models for massive multimodal embedding tasks. InICLR, 2025. 2

work page 2025
[22]

BRA VE: Broadening the visual encoding of vision-language models

O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. BRA VE: Broadening the visual encoding of vision-language models. InECCV, 2024. 2

work page 2024
[23]

Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bo- hyung Han, Kyoung Mu Lee, and Honglak et al. Lee. NICE: CVPR 2023 challenge on zero-shot image captioning. In CVPR, 2024. 1

work page 2023
[24]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1

work page 2023
[25]

DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022

Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Pa- padopoulos, Ioannis Kompatsiaris, and Ioannis Patras. DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022. 2

work page 2022
[26]

ILIAS: Instance-level image retrieval at scale

Giorgos Kordopatis-Zilos, Vladan Stojni ´c, Anna Manko, Pavel ˇSuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymi- adis, Zakaria Laskar, Jiˇr´ı Matas, Ondˇrej Chum, and Giorgos Tolias. ILIAS: Instance-level image retrieval at scale. In CVPR, 2025. 5, 6, 7, 3

work page 2025
[27]

Correlation verification for image retrieval

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval. InCVPR,

work page
[28]

Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024. 2

work page 2024
[29]

TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025. 2

work page 2025
[30]

MM-Embed: Universal multimodal retrieval with multimodal LLMs

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal multimodal retrieval with multimodal LLMs. In ICLR, 2025. 2

work page 2025
[31]

IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval

Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval. InICCVW, 2025. 2

work page 2025
[32]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

work page 2023
[33]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

work page
[34]

LamRA: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. LamRA: Large multimodal model as your advanced retrieval assistant. InCVPR, 2025. 2, 5, 7, 3

work page 2025
[35]

Sampling wisely: Deep image embedding by top-k precision optimization

Jing Lu, Chaofan Xu, Wei Zhang, Ling-Yu Duan, and Tao Mei. Sampling wisely: Deep image embedding by top-k precision optimization. InICCV, 2019. 2

work page 2019
[36]

Gpt-4o system card

OpenAI. Gpt-4o system card. InarXiv, 2024. 1

work page 2024
[37]

Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. InCVPR, 2007. 1, 3

work page 2007
[38]

Lost in quantization: Improving particu- lar object retrieval in large scale image databases

James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particu- lar object retrieval in large scale image databases. InCVPR,

work page
[39]

Instance-level composed image retrieval

Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagio- tis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. Instance-level composed image retrieval. InNeurIPS, 2025. 2

work page 2025
[40]

Revisiting oxford and paris: Large-scale image retrieval benchmarking

Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 5, 3

work page 2018
[41]

Fine- tuning cnn image retrieval with no human annotation.PAMI,

Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation.PAMI,

work page
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2

work page 2021
[43]

DINO-X: A unified vision model for open-world object detection and understanding

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. DINO-X: A unified vision model for open-world object detection and understanding. InarXiv,

work page
[44]

Learning with average precision: Train- ing image retrieval with a listwise loss

Jerome Revaud, Jon Almaz ´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Train- ing image retrieval with a listwise loss. InICCV, 2019. 2

work page 2019
[45]

Facenet: A unified embedding for face recognition and clus- tering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InCVPR, 2015. 2

work page 2015
[46]

LAION-5B: An open large-scale dataset for train- ing next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. LAION-5B: An open large-scale dataset for train- ing next generation image-text models. InNeurIPS, 2022. 5

work page 2022
[47]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2

work page 2025
[48]

LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025

Hengcan Shi, Son Duy Dao, and Jianfei Cai. LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025. 1

work page 2025
[49]

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3. InarXiv, 2025. 5, 7

work page 2025
[50]

Video Google: A text retrieval approach to object matching in videos

Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching in videos. InICCV,

work page
[51]

Open-world semantic segmen- tation including class similarity

Matteo Sodano, Federico Magistri, Lucas Nunes, Jens Behley, and Cyrill Stachniss. Open-world semantic segmen- tation including class similarity. InCVPR, 2024. 1

work page 2024
[52]

Improved deep metric learning with multi- class n-pair loss objective

Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InNeurIPS, 2016. 2

work page 2016
[53]

AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval

Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, and Giorgos Tolias. AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval. InECCV,

work page
[54]

Elvis: Efficient visual similarity from local descriptors that generalizes across domains

Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, and Giorgos Tolias. Elvis: Efficient visual similarity from local descriptors that generalizes across domains. InICLR,

work page
[55]

Instance- level image retrieval using reranking transformers

Fuwen Tan, Jiangbo Yuan, and Vicente Ordonez. Instance- level image retrieval using reranking transformers. InCVPR,

work page
[56]

Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InCVPR, 2025. 2

work page 2025
[57]

Gemini: a family of highly capable multimodal models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. InarXiv, 2023. 1

work page 2023
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017. 3

work page 2017
[59]

Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. InarXiv, 2024. 2

work page 2024
[60]

INSTRE: A new bench- mark for instance-level object retrieval and recognition

Shuang Wang and Shuqiang Jiang. INSTRE: A new bench- mark for instance-level object retrieval and recognition. TOMM, 2015. 5, 3

work page 2015
[61]

InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. InarXiv, 2025. 2, 5

work page 2025
[62]

UniIR: Train- ing and benchmarking universal multimodal information re- trievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Train- ing and benchmarking universal multimodal information re- trievers. InECCV, 2024. 2

work page 2024
[63]

Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andr ´e Araujo, Bingyi Cao, and Jack Sim. Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020. 1, 5

work page 2020
[64]

LOCORE: Image re-ranking with long-context se- quence modeling

Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, and Vicente Or- donez. LOCORE: Image re-ranking with long-context se- quence modeling. InCVPR, 2025. 2

work page 2025
[65]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. InarXiv,

work page
[66]

DetCLIPv3: To- wards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: To- wards versatile generative open-vocabulary object detection. InCVPR, 2024. 1

work page 2024
[67]

Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining

Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Min- long Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. InICCV, 2021. 5, 3

work page 2021
[68]

Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models. In CVPR, 2025. 2

work page 2025
[69]

Qwen3 embedding: Advancing text embedding and reranking through foundation models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. In arXiv, 2025. 5

work page 2025
[70]

R2former: Unified retrieval and reranking transformer for place recognition

Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiao- hui Shen, and Heng Wang. R2former: Unified retrieval and reranking transformer for place recognition. InCVPR, 2023. 2

work page 2023

[1] [1]

Reynolds

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, and Malcolm el al. Reynolds. Flamingo: a visual language model for few-shot learning. InNeurIPS,

work page

[2] [2]

DivPrune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 4

work page 2025

[3] [3]

NetVLAD: CNN architecture for weakly supervised place recognition

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. InCVPR, 2016. 1

work page 2016

[4] [4]

MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. InarXiv, 2024. 1

work page 2024

[5] [5]

Aggregating deep convolutional features for image retrieval

Artem Babenko and Victor Lempitsky. Aggregating deep convolutional features for image retrieval. InICCV, 2015. 2

work page 2015

[6] [6]

Qwen2.5-VL technical report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. InarXiv, 2025. 2, 5, 7

work page 2025

[7] [7]

Perception Encoder: The best visual embeddings are not at the output of the net- work

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS, 2025. 7

work page 2025

[8] [8]

Unifying deep local and global features for image search

Bingyi Cao, Andr ´e Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 1, 5

work page 2020

[9] [9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2

work page 2024

[10] [10]

YOLO-World: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open- vocabulary object detection. InCVPR, 2024. 1

work page 2024

[11] [11]

InstructBLIP: Towards general-purpose vision- language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1

work page 2023

[12] [12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICML, 2021. 3

work page 2021

[13] [13]

The Llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. InarXiv, 2024. 1, 2

work page 2024

[14] [14]

Composed image retrieval for training-free domain conversion

Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstanti- nos Karantzalos, Yannis Avrithis, Ondˇrej Chum, and Giorgos Tolias. Composed image retrieval for training-free domain conversion. InWACV, 2025. 2

work page 2025

[15] [15]

FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval

Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao. FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval. InSIGIR, 2025. 2

work page 2025

[16] [16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 2

work page 2022

[17] [17]

FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity. In CVPR, 2025. 1

work page 2025

[18] [18]

CoLLM: A large language model for composed image retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. CoLLM: A large language model for composed image retrieval. InCVPR, 2025. 2

work page 2025

[19] [19]

Product quantization for nearest neighbor search.PAMI, 2011

Herv ´e J´egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.PAMI, 2011. 4

work page 2011

[20] [20]

E5-V: Universal embeddings with multi- modal large language models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: Universal embeddings with multi- modal large language models. InarXiv, 2024. 2

work page 2024

[21] [21]

VLM2Vec: Training vision- language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training vision- language models for massive multimodal embedding tasks. InICLR, 2025. 2

work page 2025

[22] [22]

BRA VE: Broadening the visual encoding of vision-language models

O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. BRA VE: Broadening the visual encoding of vision-language models. InECCV, 2024. 2

work page 2024

[23] [23]

Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bo- hyung Han, Kyoung Mu Lee, and Honglak et al. Lee. NICE: CVPR 2023 challenge on zero-shot image captioning. In CVPR, 2024. 1

work page 2023

[24] [24]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1

work page 2023

[25] [25]

DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022

Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Pa- padopoulos, Ioannis Kompatsiaris, and Ioannis Patras. DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022. 2

work page 2022

[26] [26]

ILIAS: Instance-level image retrieval at scale

Giorgos Kordopatis-Zilos, Vladan Stojni ´c, Anna Manko, Pavel ˇSuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymi- adis, Zakaria Laskar, Jiˇr´ı Matas, Ondˇrej Chum, and Giorgos Tolias. ILIAS: Instance-level image retrieval at scale. In CVPR, 2025. 5, 6, 7, 3

work page 2025

[27] [27]

Correlation verification for image retrieval

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval. InCVPR,

work page

[28] [28]

Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024

Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024. 2

work page 2024

[29] [29]

TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025. 2

work page 2025

[30] [30]

MM-Embed: Universal multimodal retrieval with multimodal LLMs

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal multimodal retrieval with multimodal LLMs. In ICLR, 2025. 2

work page 2025

[31] [31]

IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval

Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval. InICCVW, 2025. 2

work page 2025

[32] [32]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

work page 2023

[33] [33]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

work page

[34] [34]

LamRA: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. LamRA: Large multimodal model as your advanced retrieval assistant. InCVPR, 2025. 2, 5, 7, 3

work page 2025

[35] [35]

Sampling wisely: Deep image embedding by top-k precision optimization

Jing Lu, Chaofan Xu, Wei Zhang, Ling-Yu Duan, and Tao Mei. Sampling wisely: Deep image embedding by top-k precision optimization. InICCV, 2019. 2

work page 2019

[36] [36]

Gpt-4o system card

OpenAI. Gpt-4o system card. InarXiv, 2024. 1

work page 2024

[37] [37]

Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. InCVPR, 2007. 1, 3

work page 2007

[38] [38]

Lost in quantization: Improving particu- lar object retrieval in large scale image databases

James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particu- lar object retrieval in large scale image databases. InCVPR,

work page

[39] [39]

Instance-level composed image retrieval

Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagio- tis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. Instance-level composed image retrieval. InNeurIPS, 2025. 2

work page 2025

[40] [40]

Revisiting oxford and paris: Large-scale image retrieval benchmarking

Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 5, 3

work page 2018

[41] [41]

Fine- tuning cnn image retrieval with no human annotation.PAMI,

Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation.PAMI,

work page

[42] [42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2

work page 2021

[43] [43]

DINO-X: A unified vision model for open-world object detection and understanding

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. DINO-X: A unified vision model for open-world object detection and understanding. InarXiv,

work page

[44] [44]

Learning with average precision: Train- ing image retrieval with a listwise loss

Jerome Revaud, Jon Almaz ´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Train- ing image retrieval with a listwise loss. InICCV, 2019. 2

work page 2019

[45] [45]

Facenet: A unified embedding for face recognition and clus- tering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InCVPR, 2015. 2

work page 2015

[46] [46]

LAION-5B: An open large-scale dataset for train- ing next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. LAION-5B: An open large-scale dataset for train- ing next generation image-text models. InNeurIPS, 2022. 5

work page 2022

[47] [47]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2

work page 2025

[48] [48]

LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025

Hengcan Shi, Son Duy Dao, and Jianfei Cai. LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025. 1

work page 2025

[49] [49]

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3. InarXiv, 2025. 5, 7

work page 2025

[50] [50]

Video Google: A text retrieval approach to object matching in videos

Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching in videos. InICCV,

work page

[51] [51]

Open-world semantic segmen- tation including class similarity

Matteo Sodano, Federico Magistri, Lucas Nunes, Jens Behley, and Cyrill Stachniss. Open-world semantic segmen- tation including class similarity. InCVPR, 2024. 1

work page 2024

[52] [52]

Improved deep metric learning with multi- class n-pair loss objective

Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InNeurIPS, 2016. 2

work page 2016

[53] [53]

AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval

Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, and Giorgos Tolias. AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval. InECCV,

work page

[54] [54]

Elvis: Efficient visual similarity from local descriptors that generalizes across domains

Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, and Giorgos Tolias. Elvis: Efficient visual similarity from local descriptors that generalizes across domains. InICLR,

work page

[55] [55]

Instance- level image retrieval using reranking transformers

Fuwen Tan, Jiangbo Yuan, and Vicente Ordonez. Instance- level image retrieval using reranking transformers. InCVPR,

work page

[56] [56]

Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InCVPR, 2025. 2

work page 2025

[57] [57]

Gemini: a family of highly capable multimodal models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. InarXiv, 2023. 1

work page 2023

[58] [58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017. 3

work page 2017

[59] [59]

Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. InarXiv, 2024. 2

work page 2024

[60] [60]

INSTRE: A new bench- mark for instance-level object retrieval and recognition

Shuang Wang and Shuqiang Jiang. INSTRE: A new bench- mark for instance-level object retrieval and recognition. TOMM, 2015. 5, 3

work page 2015

[61] [61]

InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. InarXiv, 2025. 2, 5

work page 2025

[62] [62]

UniIR: Train- ing and benchmarking universal multimodal information re- trievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Train- ing and benchmarking universal multimodal information re- trievers. InECCV, 2024. 2

work page 2024

[63] [63]

Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andr ´e Araujo, Bingyi Cao, and Jack Sim. Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020. 1, 5

work page 2020

[64] [64]

LOCORE: Image re-ranking with long-context se- quence modeling

Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, and Vicente Or- donez. LOCORE: Image re-ranking with long-context se- quence modeling. InCVPR, 2025. 2

work page 2025

[65] [65]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. InarXiv,

work page

[66] [66]

DetCLIPv3: To- wards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: To- wards versatile generative open-vocabulary object detection. InCVPR, 2024. 1

work page 2024

[67] [67]

Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining

Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Min- long Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. InICCV, 2021. 5, 3

work page 2021

[68] [68]

Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models. In CVPR, 2025. 2

work page 2025

[69] [69]

Qwen3 embedding: Advancing text embedding and reranking through foundation models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. In arXiv, 2025. 5

work page 2025

[70] [70]

R2former: Unified retrieval and reranking transformer for place recognition

Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiao- hui Shen, and Heng Wang. R2former: Unified retrieval and reranking transformer for place recognition. InCVPR, 2023. 2

work page 2023