Indexing Multimodal Language Models for Large-scale Image Retrieval
Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3
The pith
Multimodal large language models can act as zero-shot similarity estimators for large-scale image retrieval by scoring image pairs through next-token probabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompting an MLLM with a pair of images and converting the next-token probabilities into a similarity score produces an effective training-free re-ranker that, when combined with memory-efficient indexing, improves instance-level image retrieval performance across diverse benchmarks and remains robust to clutter, occlusion, and small objects.
What carries the argument
Prompting an MLLM with paired images and mapping next-token probabilities to similarity scores for zero-shot re-ranking of indexed candidates.
If this is right
- Large-scale retrieval pipelines can avoid domain-specific re-rankers by substituting an off-the-shelf MLLM for the ranking stage.
- The same prompting method scales to new visual domains because it relies only on the model's pre-trained multimodal knowledge.
- Memory-efficient indexing plus top-k re-ranking keeps the approach practical even when the candidate pool is millions of images.
- Failure under extreme appearance variation points to a concrete limit that future prompting or adaptation strategies would need to address.
Where Pith is reading between the lines
- The same token-probability trick might transfer to other vision-only tasks such as clustering or duplicate detection without additional training.
- If the approach generalizes, it reduces the engineering cost of maintaining separate retrieval models for each new visual domain.
- Combining MLLM scores with existing geometric verification steps could further tighten precision on hard cases.
Load-bearing premise
Next-token probabilities from an MLLM prompted with two images directly indicate how visually similar those images are at the instance level, without needing any adaptation or fine-tuning.
What would settle it
A controlled test on a benchmark containing images with severe viewpoint or lighting changes where the MLLM re-ranker ranks correct matches lower than a standard task-specific re-ranker or a simple feature baseline.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using Multimodal Large Language Models (MLLMs) as training-free similarity estimators for instance-level image-to-image retrieval. It prompts MLLMs with paired images, converts next-token probabilities into similarity scores for zero-shot re-ranking, and combines this with memory-efficient indexing for scalability. Experiments on diverse benchmarks are claimed to show MLLMs outperforming task-specific re-rankers outside native domains with superior robustness to clutter, occlusion, and small objects, while noting failure modes under severe appearance changes.
Significance. If the results hold after validation, the work would be significant as a demonstration of repurposing general-purpose MLLMs for vision-only retrieval without fine-tuning or custom architectures. The scalability approach via indexing addresses a practical barrier, and the robustness findings could influence open-world retrieval pipelines if the probability proxy is shown to be reliable.
major comments (2)
- [Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.
- [Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.
minor comments (2)
- [Abstract] The abstract refers to 'diverse benchmarks' without naming them; listing the specific datasets would improve reproducibility and context.
- [Method] Consider adding a diagram of the paired-image prompt template and score computation pipeline to clarify the core construction.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's identification of areas where greater clarity is needed to support the core claims. We will revise the manuscript to incorporate explicit methodological details and expanded experimental information, while preserving the original contributions.
read point-by-point responses
-
Referee: [Method] Method description (probability-to-similarity conversion): No explicit formula, algorithm, or ablation is supplied showing that next-token probabilities under paired-image prompts are monotonic with instance-level visual similarity, invariant to prompt phrasing, or superior to direct visual embeddings. This step is load-bearing for all zero-shot re-ranking and robustness claims.
Authors: We agree that an explicit formulation is necessary for the probability-to-similarity conversion. In the revised manuscript we will add the precise formula (mapping the next-token probability of an affirmative response token under the paired-image prompt to a similarity score), pseudocode for the full prompting and scoring procedure, and a new ablation study. The ablation will demonstrate monotonicity with ground-truth instance similarity, invariance across prompt phrasings, and a direct comparison against the MLLM's native visual embeddings to quantify any advantage of the probability-based proxy. revision: yes
-
Referee: [Experiments] Experimental evaluation: The abstract and available text supply no details on datasets, baselines, exact prompting templates, the probability conversion formula, or statistical significance tests. This prevents verification of the outperformance and robustness advantages asserted for clutter/occlusion/small-object cases.
Authors: We acknowledge the need for full reproducibility details. The revised manuscript will explicitly enumerate all evaluation datasets and their characteristics, list every baseline (including task-specific re-rankers), provide the exact prompting templates used, restate the probability conversion formula, and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) on the performance differences. We will also expand the robustness analysis with quantitative breakdowns for clutter, occlusion, and small-object subsets, including failure-case examples under severe appearance change. revision: yes
Circularity Check
No significant circularity; empirical prompting method is self-contained
full rationale
The paper describes a training-free method that prompts pre-trained MLLMs with image pairs and converts next-token probabilities into similarity scores for re-ranking. No equations, parameter fitting, derivations, or self-referential chains appear in the approach or claims. Results rest on direct experimental evaluation across external benchmarks rather than any reduction of outputs to inputs by construction. Self-citations, if present, are not load-bearing for any central premise.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
DivPrune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 4
work page 2025
-
[3]
NetVLAD: CNN architecture for weakly supervised place recognition
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. InCVPR, 2016. 1
work page 2016
-
[4]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Es- sam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elho- seiny. MiniGPT4-Video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. InarXiv, 2024. 1
work page 2024
-
[5]
Aggregating deep convolutional features for image retrieval
Artem Babenko and Victor Lempitsky. Aggregating deep convolutional features for image retrieval. InICCV, 2015. 2
work page 2015
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report. InarXiv, 2025. 2, 5, 7
work page 2025
-
[7]
Perception Encoder: The best visual embeddings are not at the output of the net- work
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS, 2025. 7
work page 2025
-
[8]
Unifying deep local and global features for image search
Bingyi Cao, Andr ´e Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 1, 5
work page 2020
-
[9]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2
work page 2024
-
[10]
YOLO-World: Real-time open- vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open- vocabulary object detection. InCVPR, 2024. 1
work page 2024
-
[11]
InstructBLIP: Towards general-purpose vision- language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1
work page 2023
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICML, 2021. 3
work page 2021
-
[13]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. InarXiv, 2024. 1, 2
work page 2024
-
[14]
Composed image retrieval for training-free domain conversion
Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstanti- nos Karantzalos, Yannis Avrithis, Ondˇrej Chum, and Giorgos Tolias. Composed image retrieval for training-free domain conversion. InWACV, 2025. 2
work page 2025
-
[15]
FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval
Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao. FiRE: Enhanc- ing MLLMs with fine-grained context learning for complex image retrieval. InSIGIR, 2025. 2
work page 2025
-
[16]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 2
work page 2022
-
[17]
FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity
Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. FINECAPTION: Compositional image caption- ing focusing on wherever you want at any granularity. In CVPR, 2025. 1
work page 2025
-
[18]
CoLLM: A large language model for composed image retrieval
Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, and Abhinav Shrivastava. CoLLM: A large language model for composed image retrieval. InCVPR, 2025. 2
work page 2025
-
[19]
Product quantization for nearest neighbor search.PAMI, 2011
Herv ´e J´egou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.PAMI, 2011. 4
work page 2011
-
[20]
E5-V: Universal embeddings with multi- modal large language models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: Universal embeddings with multi- modal large language models. InarXiv, 2024. 2
work page 2024
-
[21]
VLM2Vec: Training vision- language models for massive multimodal embedding tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2Vec: Training vision- language models for massive multimodal embedding tasks. InICLR, 2025. 2
work page 2025
-
[22]
BRA VE: Broadening the visual encoding of vision-language models
O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. BRA VE: Broadening the visual encoding of vision-language models. InECCV, 2024. 2
work page 2024
-
[23]
Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bo- hyung Han, Kyoung Mu Lee, and Honglak et al. Lee. NICE: CVPR 2023 challenge on zero-shot image captioning. In CVPR, 2024. 1
work page 2023
-
[24]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1
work page 2023
-
[25]
DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022
Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Pa- padopoulos, Ioannis Kompatsiaris, and Ioannis Patras. DnS: Distill-and-select for efficient and accurate video indexing and retrieval.IJCV, 2022. 2
work page 2022
-
[26]
ILIAS: Instance-level image retrieval at scale
Giorgos Kordopatis-Zilos, Vladan Stojni ´c, Anna Manko, Pavel ˇSuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymi- adis, Zakaria Laskar, Jiˇr´ı Matas, Ondˇrej Chum, and Giorgos Tolias. ILIAS: Instance-level image retrieval at scale. In CVPR, 2025. 5, 6, 7, 3
work page 2025
-
[27]
Correlation verification for image retrieval
Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval. InCVPR,
-
[28]
Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024
Seongwon Lee, Hongje Seong, Suhyeon Lee, and Euntai Kim. Correlation verification for image retrieval and its memory footprint optimization.PAMI, 2024. 2
work page 2024
-
[29]
TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Efficient visual projector for multimodal LLM.IJCV, 2025. 2
work page 2025
-
[30]
MM-Embed: Universal multimodal retrieval with multimodal LLMs
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM-Embed: Universal multimodal retrieval with multimodal LLMs. In ICLR, 2025. 2
work page 2025
-
[31]
IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval
Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. IDMR: Towards instance-driven precise visual correspon- dence in multimodal retrieval. InICCVW, 2025. 2
work page 2025
-
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1
work page 2023
-
[33]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,
-
[34]
LamRA: Large multimodal model as your advanced retrieval assistant
Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. LamRA: Large multimodal model as your advanced retrieval assistant. InCVPR, 2025. 2, 5, 7, 3
work page 2025
-
[35]
Sampling wisely: Deep image embedding by top-k precision optimization
Jing Lu, Chaofan Xu, Wei Zhang, Ling-Yu Duan, and Tao Mei. Sampling wisely: Deep image embedding by top-k precision optimization. InICCV, 2019. 2
work page 2019
- [36]
-
[37]
Object retrieval with large vocabularies and fast spatial matching
James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. InCVPR, 2007. 1, 3
work page 2007
-
[38]
Lost in quantization: Improving particu- lar object retrieval in large scale image databases
James Philbin, Ond ˇrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particu- lar object retrieval in large scale image databases. InCVPR,
-
[39]
Instance-level composed image retrieval
Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagio- tis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. Instance-level composed image retrieval. InNeurIPS, 2025. 2
work page 2025
-
[40]
Revisiting oxford and paris: Large-scale image retrieval benchmarking
Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 5, 3
work page 2018
-
[41]
Fine- tuning cnn image retrieval with no human annotation.PAMI,
Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation.PAMI,
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 2
work page 2021
-
[43]
DINO-X: A unified vision model for open-world object detection and understanding
Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al. DINO-X: A unified vision model for open-world object detection and understanding. InarXiv,
-
[44]
Learning with average precision: Train- ing image retrieval with a listwise loss
Jerome Revaud, Jon Almaz ´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Train- ing image retrieval with a listwise loss. InICCV, 2019. 2
work page 2019
-
[45]
Facenet: A unified embedding for face recognition and clus- tering
Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clus- tering. InCVPR, 2015. 2
work page 2015
-
[46]
LAION-5B: An open large-scale dataset for train- ing next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. LAION-5B: An open large-scale dataset for train- ing next generation image-text models. InNeurIPS, 2022. 5
work page 2022
-
[47]
LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2
work page 2025
-
[48]
LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025
Hengcan Shi, Son Duy Dao, and Jianfei Cai. LLMFormer: Large language model for open-vocabulary semantic seg- mentation.IJCV, 2025. 1
work page 2025
-
[49]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3. InarXiv, 2025. 5, 7
work page 2025
-
[50]
Video Google: A text retrieval approach to object matching in videos
Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching in videos. InICCV,
-
[51]
Open-world semantic segmen- tation including class similarity
Matteo Sodano, Federico Magistri, Lucas Nunes, Jens Behley, and Cyrill Stachniss. Open-world semantic segmen- tation including class similarity. InCVPR, 2024. 1
work page 2024
-
[52]
Improved deep metric learning with multi- class n-pair loss objective
Kihyuk Sohn. Improved deep metric learning with multi- class n-pair loss objective. InNeurIPS, 2016. 2
work page 2016
-
[53]
AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval
Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, and Giorgos Tolias. AMES: Asymmetric and memory-efficient similarity estimation for instance-level retrieval. InECCV,
-
[54]
Elvis: Efficient visual similarity from local descriptors that generalizes across domains
Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, and Giorgos Tolias. Elvis: Efficient visual similarity from local descriptors that generalizes across domains. InICLR,
-
[55]
Instance- level image retrieval using reranking transformers
Fuwen Tan, Jiangbo Yuan, and Vicente Ordonez. Instance- level image retrieval using reranking transformers. InCVPR,
-
[56]
Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InCVPR, 2025. 2
work page 2025
-
[57]
Gemini: a family of highly capable multimodal models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. InarXiv, 2023. 1
work page 2023
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017. 3
work page 2017
-
[59]
Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. InarXiv, 2024. 2
work page 2024
-
[60]
INSTRE: A new bench- mark for instance-level object retrieval and recognition
Shuang Wang and Shuqiang Jiang. INSTRE: A new bench- mark for instance-level object retrieval and recognition. TOMM, 2015. 5, 3
work page 2015
-
[61]
InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. InarXiv, 2025. 2, 5
work page 2025
-
[62]
UniIR: Train- ing and benchmarking universal multimodal information re- trievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. UniIR: Train- ing and benchmarking universal multimodal information re- trievers. InECCV, 2024. 2
work page 2024
-
[63]
Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval
Tobias Weyand, Andr ´e Araujo, Bingyi Cao, and Jack Sim. Google Landmarks Dataset v2 – A large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020. 1, 5
work page 2020
-
[64]
LOCORE: Image re-ranking with long-context se- quence modeling
Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, and Vicente Or- donez. LOCORE: Image re-ranking with long-context se- quence modeling. InCVPR, 2025. 2
work page 2025
-
[65]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. InarXiv,
-
[66]
DetCLIPv3: To- wards versatile generative open-vocabulary object detection
Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: To- wards versatile generative open-vocabulary object detection. InCVPR, 2024. 1
work page 2024
-
[67]
Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Min- long Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Prod- uct1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. InICCV, 2021. 5, 3
work page 2021
-
[68]
Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models. In CVPR, 2025. 2
work page 2025
-
[69]
Qwen3 embedding: Advancing text embedding and reranking through foundation models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models. In arXiv, 2025. 5
work page 2025
-
[70]
R2former: Unified retrieval and reranking transformer for place recognition
Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiao- hui Shen, and Heng Wang. R2former: Unified retrieval and reranking transformer for place recognition. InCVPR, 2023. 2
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.