pith. sign in

arxiv: 2403.13805 · v2 · pith:JAYXG4FFnew · submitted 2024-03-20 · 💻 cs.CV · cs.AI· cs.LG

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords fine-grained visual recognitionfew-shot image recognitionzero-shot object detectionmultimodal large language modelsimage retrievalrankingexternal memory augmentation
0
0 comments X

The pith

CLIP-based retrieval of top-k candidates followed by MLLM ranking improves accuracy on fine-grained and few-shot visual tasks with large category sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAR, a method that builds an external memory of categories using a CLIP retriever and then has the multimodal large language model rank the top-k retrieved items to produce the final prediction. This setup is intended to overcome the drop in MLLM performance that occurs when the number of categories grows large enough to strain context windows and increase decision complexity, while keeping the broad knowledge the models acquired during pre-training. The authors show the combination yields higher accuracy than either component alone on fine-grained benchmarks, few-shot datasets, and zero-shot object detection. A reader would care because the approach offers a practical route to scaling precise visual recognition to vocabularies that exceed what a single forward pass through an MLLM can reliably handle.

Core claim

RAR creates explicit category memory outside the context window with a CLIP multi-modal retriever, retrieves the top-k similar items at inference time, and lets the MLLM rank those candidates to reach a final prediction; this produces significant gains on five fine-grained visual recognition benchmarks, eleven few-shot image recognition datasets, and two object detection datasets under zero-shot recognition.

What carries the argument

CLIP multi-modal retriever that stores and queries external category memory, followed by MLLM ranking of the retrieved top-k items to select the output label.

If this is right

  • Accuracy rises on the five reported fine-grained visual recognition benchmarks.
  • Performance improves across the eleven few-shot image recognition datasets.
  • Zero-shot object detection accuracy increases on the two evaluated datasets.
  • The method maintains the MLLM's broad pre-trained knowledge while mitigating its fine-grained limitations at scale.
  • The external-memory design allows handling category vocabularies larger than the model's context window permits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieve-then-rank pattern could be tested on visual question answering or captioning tasks that also face large output spaces.
  • Replacing the CLIP retriever with a stronger or task-specific one would be a direct way to reduce failure cases where the correct label is missed in the top-k.
  • The separation of long-term memory from the language model's context window suggests a reusable template for other classification problems that outgrow single-pass context limits.

Load-bearing premise

The correct category will appear among the top-k items returned by the CLIP retriever so that the subsequent ranking step can still produce an accurate prediction even when the total number of categories is large.

What would settle it

Measure accuracy on a test set where the ground-truth label is forced to be absent from the CLIP top-k retrieval results; if accuracy falls to near-chance levels the central claim is falsified.

Figures

Figures reproduced from arXiv: 2403.13805 by Dahua Lin, Jiaqi Wang, Pan Zhang, Wei Li, Xiaoyi Dong, Yuanjun Xiong, Yuhang Zang, Zeyi Sun, Ziyu Liu.

Figure 1
Figure 1. Figure 1: Upper left: our motivation about the drawbacks of CLIP and MLLM. Our RAR can seamlessly integrate into MLLMs to improve the few-shot/zero-shot abilities on classification (upper right) and detection (bottom) datasets. CLIP’s performance begins to wane when faced with datasets characterized by vast vocabularies or fine-grained categories. As shown in the upper left of [PITH_FULL_IMAGE:figures/full_fig_p002… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of RAR. (a) We design a multimodal retriever that extracts the image or text embeddings and stores embeddings in an external memory M. (b) For the inference stage of downstream recognition tasks, we retrieve top-k categories from the memory and use MLLMs to refine the retrieved results as the final prediction through ranking. M to compute similarity scores (e.g., cosine similarity) and subsequentl… view at source ↗
Figure 3
Figure 3. Figure 3: Extending our multimodal retriever to zero-shot recognition on object detec￾tion datasets such as LVIS [14] and V3Det [48]. Compared to the classification datasets, we apply the additional pre-processing techniques such as cropping and resizing to extract the image embeddings. is designed to direct the MLLMs’ focus toward the relevant objects, thereby facilitating their identification in object detection t… view at source ↗
Figure 4
Figure 4. Figure 4: Ranking Prompt examples for few-shot image classification. The fine￾grained image examples are from Stanford Cars [20]. We incorporate the initial top-k retrieved results (e.g., k = 5) into our ranking prompts and use the MLLMs to rank the retrieved results and make the final prediction. classification, enabling our system to handle a wide variety of images and cate￾gories with high precision and flexibili… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the ranking examples for zero-shot object recognition on LVIS [14] validation set. Given the top retrieved predictions, our RAR uses MLLMs to select the correct class names accurately. ranking mechanisms, have once again demonstrated their robust performance in the domain of object detection datasets. Using our retrieval-augmented approach allows MLLMs to navigate the extensive and fine-gr… view at source ↗
Figure 6
Figure 6. Figure 6: Datsets used in our experiments. We select 14 classification datasets (7 fine￾grained and 7 common) and 2 object detection datasets as our benchmarks [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GPT4V Example for Stanford Cars and FGVC Aircraft. Green for ground truth, blue for characteristics analyzed by GPT-4V. V3Det brings an unprecedented level of diversity to the table. The dataset in￾cludes 1,753,000 meticulously annotated bounding boxes, making it an invaluable resource for developing and testing detection algorithms capable of handling a wide variety of object types. Its large number of ca… view at source ↗
Figure 8
Figure 8. Figure 8: GPT4V Example for Flowers102, Pets37 and Food101. Green for ground truth, blue for characteristics analyzed by GPT-4V. Additionally, to assess the visual recognition and ranking capabilities of MLLMs themselves, we have prepared a prompt with examples to serve as input for the model. Our structured in-context learning prompt is as follows: “Please play the role of a classification expert, and sort the prov… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation on CLIP+KNN for Caltech101, Flowers102, RAF-DB, Pets37, DTD and UCF101. We report the top-1, 5, 10, 15, 20 accuracy (%) under the 4-shot settings [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation on MLLMs for Caltech101, Flowers102. We report the test results using 10, 15, 20, 25, and 30 category names as inputs. we select three unlabeled images to form a 3-shot setting. Then, we extract embeddings using the CLIP B/16 model and store them in memory. The labels for each image correspond to the predictions in [31]. We then test the validation set using the RAR pipeline and measure the res… view at source ↗
Figure 11
Figure 11. Figure 11: Metric curve visualization of CLIP [41] zero-shot classification on LVIS [14] with ground truth proposals. Different behaviors can be seen before and after blurring with respect to different object’s scales. and 16-shot experiments in the supplementary materials, alongside the results of 4-shot and 8-shot experiments, all of which are presented in Tab. 8. From the 1-shot to 16-shot experiments, RAR’s resu… view at source ↗
read the original abstract

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes RAR, a Retrieving And Ranking augmentation for MLLMs in visual recognition. It builds an external memory of category prototypes using a CLIP multi-modal retriever, retrieves top-k candidates at inference time, and has the MLLM rank them to produce the final label. The central claim is that this combination overcomes MLLM context limits on large/fine-grained vocabularies while retaining broad knowledge, yielding significant accuracy gains on 5 fine-grained benchmarks, 11 few-shot datasets, and 2 zero-shot object detection datasets.

Significance. If the reported gains are reproducible and the retriever recall is shown to be high, the approach would provide a lightweight way to scale MLLM recognition to fine-grained tasks without retraining or expanding context windows, addressing a practical limitation in current vision-language systems.

major comments (3)
  1. [Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.
  2. [Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.
  3. [Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.
minor comments (1)
  1. [Abstract] The abstract lists the number of benchmarks but does not name them or cite the original papers; adding these references would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.

    Authors: We agree that direct verification of retriever recall would make the claims more robust. While the reported accuracy gains across benchmarks imply that the ground-truth is frequently retrieved within top-k, we will add recall@k tables, an ablation that excludes the ground-truth from the candidate pool, and failure-case analysis on the largest-vocabulary tasks in the revised version. revision: yes

  2. Referee: [Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.

    Authors: We will expand the method section to include the precise computation of category prototypes (CLIP image and text embeddings averaged per class and stored in the memory bank) and the verbatim MLLM ranking prompt template. These additions will demonstrate that the approach relies on standard CLIP and MLLM components without per-dataset hyperparameter tuning. revision: yes

  3. Referee: [Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.

    Authors: The experiments already include direct comparisons to MLLM-only and CLIP baselines on all reported datasets. To further strengthen the presentation, we will add statistical significance tests (e.g., McNemar or paired t-tests), an ablation study varying k, and performance tables stratified by vocabulary size and number of shots. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical retrieval-ranking method with no self-referential derivations

full rationale

The paper proposes an empirical pipeline (CLIP-based multi-modal retriever populates external memory; top-k items are ranked by MLLM at inference) whose performance claims rest entirely on reported benchmark numbers across fine-grained, few-shot, and detection tasks. No equations, parameters, or uniqueness theorems are defined in terms of the target outputs; the method description contains no fitted-input-called-prediction, self-definitional, or self-citation-load-bearing steps. The central claim is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that CLIP retrieval supplies useful candidates and on the hyperparameter choice of retrieval count; no new entities are postulated.

free parameters (1)
  • top-k
    Number of retrieved candidates passed to the MLLM for ranking; chosen as a hyperparameter without reported fitting procedure.
axioms (1)
  • domain assumption CLIP embeddings enable effective retrieval of relevant category examples for fine-grained visual tasks.
    Invoked to justify construction of the multi-modal retriever and external memory.

pith-pipeline@v0.9.0 · 5814 in / 1266 out tokens · 36277 ms · 2026-05-24T03:29:14.125046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  2. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    arXiv.org (2023) 2, 4

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv.org (2023) 2, 4

  2. [2]

    arXiv.org (2023) 2, 4, 12

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org (2023) 2, 4, 12

  3. [3]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2, 4

  4. [4]

    In: CVPR (2014) 9, 19

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014) 9, 19

  5. [5]

    In: NeurIPS (2024) 10

    Conti, A., Fini, E., Mancini, M., Rota, P., Wang, Y., Ricci, E.: Vocabulary-free image classification. In: NeurIPS (2024) 10

  6. [6]

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4

  7. [7]

    In: CVPR (2009) 9, 19

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009) 9, 19

  8. [8]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., Zeng, M., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N.: Maskclip: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10995–11005 (June 2023) 1, 3

  9. [9]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024) 12

  10. [10]

    In: CVPR (2023) 3

    Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2023) 3

  11. [11]

    In: CVPR workshop (2004) 9, 19

    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. In: CVPR workshop (2004) 9, 19

  12. [12]

    IJCV (2023) 3

    Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- Adapter: Better vision-language models with feature adapters. IJCV (2023) 3

  13. [13]

    In: ICLR (2022) 3 16 Ziyu Liu et al

    Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 3 16 Ziyu Liu et al

  14. [14]

    In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

    Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

  15. [15]

    Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (2019) 9, 19

  16. [16]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 11

  17. [17]

    In: CVPR (2023) 4

    Iscen, A., Fathi, A., Schmid, C.: Improving image recognition by retrieving from web-scale image-text data. In: CVPR (2023) 4

  18. [18]

    In: ICML (2021) 3

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021) 3

  19. [19]

    In: CVPR workshop (2011) 9, 19

    Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR workshop (2011) 9, 19

  20. [20]

    In: ICCV workshops (2013) 8, 9, 19

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: ICCV workshops (2013) 8, 9, 19

  21. [21]

    NeurIPS (2020) 4

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-Augmented generation for knowledge-intensive nlp tasks. NeurIPS (2020) 4

  22. [22]

    In: ICML (2023) 10

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 10

  23. [23]

    In: ICML (2022) 3

    Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022) 3

  24. [24]

    In: CVPR (2022) 3

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: CVPR (2022) 3

  25. [25]

    In: CVPR (2017) 9, 19

    Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learn- ing for expression recognition in the wild. In: CVPR (2017) 9, 19

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre- training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23390–23400 (June 2023) 1, 3

  27. [27]

    In: CVPR (2023) 3

    Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023) 3

  28. [28]

    Improved Baselines with Visual Instruction Tuning

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023) 2, 11, 12, 24

  29. [29]

    In: NeurIPS (2024) 2, 4

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024) 2, 4

  30. [30]

    In: CVPR (2023) 4

    Liu, H., Son, K., Yang, J., Liu, C., Gao, J., Lee, Y.J., Li, C.: Learning customized visual models with retrieval-augmented knowledge. In: CVPR (2023) 4

  31. [31]

    In: ICLR (2024) 4, 9, 10, 22, 23

    Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024) 4, 9, 10, 22, 23

  32. [32]

    In: CVPR (2022) 4

    Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., van den Hengel, A.: Retrieval augmented classification for long-tail visual recognition. In: CVPR (2022) 4

  33. [33]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

    Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 7086–7096 (June 2022) 3, 6 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 17

  34. [34]

    Fine-Grained Visual Classification of Aircraft

    Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 9, 19

  35. [35]

    TPAMI (2018) 6, 9

    Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018) 6, 9

  36. [36]

    Communications of the ACM (1995) 10

    Miller, G.A.: WordNet: a lexical database for english. Communications of the ACM (1995) 10

  37. [37]

    In: ICVGIP (2008) 9, 19

    Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008) 9, 19

  38. [38]

    OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card 2, 4, 5, 25

  39. [39]

    In: CVPR (2012) 9, 19

    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012) 9, 19

  40. [40]

    arXiv.org (2023) 2, 4

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv.org (2023) 2, 4

  41. [41]

    In: ICML (2021) 1, 3, 4, 11, 26, 27

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 1, 3, 4, 11, 26, 27

  42. [42]

    arXiv preprint arXiv:2304.06712 (2023) 3

    Shtedritski,A.,Rupprecht,C.,Vedaldi,A.:Whatdoesclipknowaboutaredcircle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 3

  43. [43]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 19

  44. [44]

    In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers)

    Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: A strong zero-shot baseline for referring expression comprehension. In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pp. 5198–5215 (2022) 3

  45. [45]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 1

  46. [46]

    Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.:Alpha-CLIP:A clipmodel focusingon whereveryouwant.arXiv preprint arXiv:2312.03818 (2023) 3

  47. [47]

    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011 (2011) 9, 19

  48. [48]

    In: ICCV (2023) 3, 7, 9, 12, 28

    Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3Det: Vast vocabulary visual detection dataset. In: ICCV (2023) 3, 7, 9, 12, 28

  49. [49]

    Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 4, 12

  50. [50]

    Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: GPT4Vis: What can gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023) 25

  51. [51]

    In: CVPR (2010) 9, 19

    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large- scale scene recognition from abbey to zoo. In: CVPR (2010) 9, 19

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023) 3

  53. [53]

    arXiv preprint arXiv:2311.01373 (2023) 3

    Yang, H., Ma, C., Wen, B., Jiang, Y., Yuan, Z., Zhu, X.: Recognize any regions. arXiv preprint arXiv:2311.01373 (2023) 3

  54. [54]

    Yang, L., Wang, Y., Li, X., Wang, X., Yang, J.: Fine-grained visual prompting (2023) 3, 6 18 Ziyu Liu et al

  55. [55]

    In: ICML (2023) 3, 4

    Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., Yih, W.t.: Retrieval-augmented multimodal language model- ing. In: ICML (2023) 3, 4

  56. [56]

    arXiv.org (2023) 2, 4

    Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv.org (2023) 2, 4

  57. [57]

    In: ECCV (2022) 3

    Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary detr with conditional matching. In: ECCV (2022) 3

  58. [58]

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    Zhang, P., Wang, X.D.B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.: Internlm-Xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023) 2, 4

  59. [59]

    In: CVPR (2022) 3, 11, 12

    Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: RegionCLIP: Region-based language-image pretraining. In: CVPR (2022) 3, 11, 12

  60. [60]

    IJCV (2022) 3

    Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022) 3

  61. [61]

    In: ECCV (2022) 3

    Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022) 3

  62. [62]

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 2, 4 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 19 Appendix In this appendix, we provide a series of detailed supporting materials to aid in a deeper und...

  63. [63]

    2012 Aston Martin V8 Vantage Coupe

  64. [64]

    2012 Aston Martin Virage Coupe

  65. [65]

    2012 Aston Martin V8 Vantage Convertible

  66. [66]

    2012 Aston Martin Virage Convertible

  67. [67]

    2012 Audi TTS Coupe Please play the role of a classification expert, and sort the provided categories from high to low according to the top 5 similarity with the input image. Here are the optional categories: ['737-800', '757-300', '767-200', 'A340-300', 'MD-90'] Based on the input image and the provided categories, the sorting from high to low similarity...

  68. [68]

    The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured

    A340-300 The aircraft in the image is a Boeing 757, which is identifiable by its long fuselage, narrow-body, and characteristic nose shape. The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured. The 767-200 and MD-90 have some similarities but differ in specific design elements such as winglets a...

  69. [69]

    The other flowers listed have different characteristics that do not match as closely

    Columbine The flower in the image appears to be a Japanese anemone, which typically has white petals surrounding a yellow center of stamens and a greenish pistil, matching the one shown. The other flowers listed have different characteristics that do not match as closely. Please play the role of a classification expert, and sort the provided categories fr...

  70. [70]

    Beagle The dog in the image is a Chihuahua, identifiable by its small size, prominent ears, and facial features. Miniature Pinschers are somewhat similar in size and ear shape, while Yorkshire Terriers, Havanese, and Beagles have distinct differences in coat, size, and facial structure compared to the Chihuahua. Please play the role of a classification ex...

  71. [71]

    coupe” (a two-door car), “long fuselage

    Donuts The dish in the image appears to include falafel balls and a side of hummus, which are typically found in Middle Eastern cuisine. The other items, such as beet salad, breakfast burrito, and donuts, do not seem to be present in the image or resemble the food shown. Why did you give this order? Why did you give this order? Why did you give this order...