RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3
The pith
CLIP-based retrieval of top-k candidates followed by MLLM ranking improves accuracy on fine-grained and few-shot visual tasks with large category sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAR creates explicit category memory outside the context window with a CLIP multi-modal retriever, retrieves the top-k similar items at inference time, and lets the MLLM rank those candidates to reach a final prediction; this produces significant gains on five fine-grained visual recognition benchmarks, eleven few-shot image recognition datasets, and two object detection datasets under zero-shot recognition.
What carries the argument
CLIP multi-modal retriever that stores and queries external category memory, followed by MLLM ranking of the retrieved top-k items to select the output label.
If this is right
- Accuracy rises on the five reported fine-grained visual recognition benchmarks.
- Performance improves across the eleven few-shot image recognition datasets.
- Zero-shot object detection accuracy increases on the two evaluated datasets.
- The method maintains the MLLM's broad pre-trained knowledge while mitigating its fine-grained limitations at scale.
- The external-memory design allows handling category vocabularies larger than the model's context window permits.
Where Pith is reading between the lines
- The same retrieve-then-rank pattern could be tested on visual question answering or captioning tasks that also face large output spaces.
- Replacing the CLIP retriever with a stronger or task-specific one would be a direct way to reduce failure cases where the correct label is missed in the top-k.
- The separation of long-term memory from the language model's context window suggests a reusable template for other classification problems that outgrow single-pass context limits.
Load-bearing premise
The correct category will appear among the top-k items returned by the CLIP retriever so that the subsequent ranking step can still produce an accurate prediction even when the total number of categories is large.
What would settle it
Measure accuracy on a test set where the ground-truth label is forced to be absent from the CLIP top-k retrieval results; if accuracy falls to near-chance levels the central claim is falsified.
Figures
read the original abstract
CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RAR, a Retrieving And Ranking augmentation for MLLMs in visual recognition. It builds an external memory of category prototypes using a CLIP multi-modal retriever, retrieves top-k candidates at inference time, and has the MLLM rank them to produce the final label. The central claim is that this combination overcomes MLLM context limits on large/fine-grained vocabularies while retaining broad knowledge, yielding significant accuracy gains on 5 fine-grained benchmarks, 11 few-shot datasets, and 2 zero-shot object detection datasets.
Significance. If the reported gains are reproducible and the retriever recall is shown to be high, the approach would provide a lightweight way to scale MLLM recognition to fine-grained tasks without retraining or expanding context windows, addressing a practical limitation in current vision-language systems.
major comments (3)
- [Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.
- [Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.
- [Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.
minor comments (1)
- [Abstract] The abstract lists the number of benchmarks but does not name them or cite the original papers; adding these references would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.
Authors: We agree that direct verification of retriever recall would make the claims more robust. While the reported accuracy gains across benchmarks imply that the ground-truth is frequently retrieved within top-k, we will add recall@k tables, an ablation that excludes the ground-truth from the candidate pool, and failure-case analysis on the largest-vocabulary tasks in the revised version. revision: yes
-
Referee: [Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.
Authors: We will expand the method section to include the precise computation of category prototypes (CLIP image and text embeddings averaged per class and stored in the memory bank) and the verbatim MLLM ranking prompt template. These additions will demonstrate that the approach relies on standard CLIP and MLLM components without per-dataset hyperparameter tuning. revision: yes
-
Referee: [Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.
Authors: The experiments already include direct comparisons to MLLM-only and CLIP baselines on all reported datasets. To further strengthen the presentation, we will add statistical significance tests (e.g., McNemar or paired t-tests), an ablation study varying k, and performance tables stratified by vocabulary size and number of shots. revision: partial
Circularity Check
No circularity: empirical retrieval-ranking method with no self-referential derivations
full rationale
The paper proposes an empirical pipeline (CLIP-based multi-modal retriever populates external memory; top-k items are ranked by MLLM at inference) whose performance claims rest entirely on reported benchmark numbers across fine-grained, few-shot, and detection tasks. No equations, parameters, or uniqueness theorems are defined in terms of the target outputs; the method description contains no fitted-input-called-prediction, self-definitional, or self-citation-load-bearing steps. The central claim is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- top-k
axioms (1)
- domain assumption CLIP embeddings enable effective retrieval of relevant category examples for fine-grained visual tasks.
Forward citations
Cited by 2 Pith papers
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Reference graph
Works this paper leans on
-
[1]
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv.org (2023) 2, 4
work page 2023
-
[2]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org (2023) 2, 4, 12
work page 2023
-
[3]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014) 9, 19
work page 2014
-
[5]
Conti, A., Fini, E., Mancini, M., Rota, P., Wang, Y., Ricci, E.: Vocabulary-free image classification. In: NeurIPS (2024) 10
work page 2024
-
[6]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4
work page 2023
-
[7]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009) 9, 19
work page 2009
-
[8]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., Zeng, M., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N.: Maskclip: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10995–11005 (June 2023) 1, 3
work page 2023
-
[9]
Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024) 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2023) 3
work page 2023
-
[11]
In: CVPR workshop (2004) 9, 19
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. In: CVPR workshop (2004) 9, 19
work page 2004
-
[12]
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- Adapter: Better vision-language models with feature adapters. IJCV (2023) 3
work page 2023
-
[13]
In: ICLR (2022) 3 16 Ziyu Liu et al
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 3 16 Ziyu Liu et al
work page 2022
-
[14]
In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28
Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28
work page 2019
-
[15]
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (2019) 9, 19
work page 2019
-
[16]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Iscen, A., Fathi, A., Schmid, C.: Improving image recognition by retrieving from web-scale image-text data. In: CVPR (2023) 4
work page 2023
-
[18]
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021) 3
work page 2021
-
[19]
In: CVPR workshop (2011) 9, 19
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR workshop (2011) 9, 19
work page 2011
-
[20]
In: ICCV workshops (2013) 8, 9, 19
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: ICCV workshops (2013) 8, 9, 19
work page 2013
-
[21]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-Augmented generation for knowledge-intensive nlp tasks. NeurIPS (2020) 4
work page 2020
-
[22]
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 10
work page 2023
-
[23]
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022) 3
work page 2022
-
[24]
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: CVPR (2022) 3
work page 2022
-
[25]
Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learn- ing for expression recognition in the wild. In: CVPR (2017) 9, 19
work page 2017
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre- training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23390–23400 (June 2023) 1, 3
work page 2023
-
[27]
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023) 3
work page 2023
-
[28]
Improved Baselines with Visual Instruction Tuning
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023) 2, 11, 12, 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024) 2, 4
work page 2024
-
[30]
Liu, H., Son, K., Yang, J., Liu, C., Gao, J., Lee, Y.J., Li, C.: Learning customized visual models with retrieval-augmented knowledge. In: CVPR (2023) 4
work page 2023
-
[31]
In: ICLR (2024) 4, 9, 10, 22, 23
Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024) 4, 9, 10, 22, 23
work page 2024
-
[32]
Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., van den Hengel, A.: Retrieval augmented classification for long-tail visual recognition. In: CVPR (2022) 4
work page 2022
-
[33]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 7086–7096 (June 2022) 3, 6 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 17
work page 2022
-
[34]
Fine-Grained Visual Classification of Aircraft
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 9, 19
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[35]
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018) 6, 9
work page 2018
-
[36]
Communications of the ACM (1995) 10
Miller, G.A.: WordNet: a lexical database for english. Communications of the ACM (1995) 10
work page 1995
-
[37]
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008) 9, 19
work page 2008
-
[38]
OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card 2, 4, 5, 25
work page 2023
-
[39]
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012) 9, 19
work page 2012
-
[40]
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv.org (2023) 2, 4
work page 2023
-
[41]
In: ICML (2021) 1, 3, 4, 11, 26, 27
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 1, 3, 4, 11, 26, 27
work page 2021
-
[42]
arXiv preprint arXiv:2304.06712 (2023) 3
Shtedritski,A.,Rupprecht,C.,Vedaldi,A.:Whatdoesclipknowaboutaredcircle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 3
-
[43]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 19
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[44]
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: A strong zero-shot baseline for referring expression comprehension. In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pp. 5198–5215 (2022) 3
work page 2022
-
[45]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [46]
-
[47]
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011 (2011) 9, 19
work page 2011
-
[48]
In: ICCV (2023) 3, 7, 9, 12, 28
Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3Det: Vast vocabulary visual detection dataset. In: ICCV (2023) 3, 7, 9, 12, 28
work page 2023
-
[49]
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 4, 12
work page 2023
- [50]
-
[51]
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large- scale scene recognition from abbey to zoo. In: CVPR (2010) 9, 19
work page 2010
-
[52]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023) 3
work page 2023
-
[53]
arXiv preprint arXiv:2311.01373 (2023) 3
Yang, H., Ma, C., Wen, B., Jiang, Y., Yuan, Z., Zhu, X.: Recognize any regions. arXiv preprint arXiv:2311.01373 (2023) 3
-
[54]
Yang, L., Wang, Y., Li, X., Wang, X., Yang, J.: Fine-grained visual prompting (2023) 3, 6 18 Ziyu Liu et al
work page 2023
-
[55]
Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., Yih, W.t.: Retrieval-augmented multimodal language model- ing. In: ICML (2023) 3, 4
work page 2023
-
[56]
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv.org (2023) 2, 4
work page 2023
-
[57]
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary detr with conditional matching. In: ECCV (2022) 3
work page 2022
-
[58]
Zhang, P., Wang, X.D.B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.: Internlm-Xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023) 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: RegionCLIP: Region-based language-image pretraining. In: CVPR (2022) 3, 11, 12
work page 2022
-
[60]
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022) 3
work page 2022
-
[61]
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022) 3
work page 2022
-
[62]
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 2, 4 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 19 Appendix In this appendix, we provide a series of detailed supporting materials to aid in a deeper und...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
2012 Aston Martin V8 Vantage Coupe
work page 2012
-
[64]
2012 Aston Martin Virage Coupe
work page 2012
-
[65]
2012 Aston Martin V8 Vantage Convertible
work page 2012
-
[66]
2012 Aston Martin Virage Convertible
work page 2012
-
[67]
2012 Audi TTS Coupe Please play the role of a classification expert, and sort the provided categories from high to low according to the top 5 similarity with the input image. Here are the optional categories: ['737-800', '757-300', '767-200', 'A340-300', 'MD-90'] Based on the input image and the provided categories, the sorting from high to low similarity...
work page 2012
-
[68]
A340-300 The aircraft in the image is a Boeing 757, which is identifiable by its long fuselage, narrow-body, and characteristic nose shape. The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured. The 767-200 and MD-90 have some similarities but differ in specific design elements such as winglets a...
-
[69]
The other flowers listed have different characteristics that do not match as closely
Columbine The flower in the image appears to be a Japanese anemone, which typically has white petals surrounding a yellow center of stamens and a greenish pistil, matching the one shown. The other flowers listed have different characteristics that do not match as closely. Please play the role of a classification expert, and sort the provided categories fr...
-
[70]
Beagle The dog in the image is a Chihuahua, identifiable by its small size, prominent ears, and facial features. Miniature Pinschers are somewhat similar in size and ear shape, while Yorkshire Terriers, Havanese, and Beagles have distinct differences in coat, size, and facial structure compared to the Chihuahua. Please play the role of a classification ex...
-
[71]
coupe” (a two-door car), “long fuselage
Donuts The dish in the image appears to include falafel balls and a side of hummus, which are typically found in Middle Eastern cuisine. The other items, such as beet salad, breakfast burrito, and donuts, do not seem to be present in the image or resemble the food shown. Why did you give this order? Why did you give this order? Why did you give this order...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.