RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Dahua Lin; Jiaqi Wang; Pan Zhang; Wei Li; Xiaoyi Dong; Yuanjun Xiong; Yuhang Zang; Zeyi Sun; Ziyu Liu

arxiv: 2403.13805 · v2 · pith:JAYXG4FFnew · submitted 2024-03-20 · 💻 cs.CV · cs.AI· cs.LG

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu , Zeyi Sun , Yuhang Zang , Wei Li , Pan Zhang , Xiaoyi Dong , Yuanjun Xiong , Dahua Lin

show 1 more author

Jiaqi Wang

This is my paper

Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords fine-grained visual recognitionfew-shot image recognitionzero-shot object detectionmultimodal large language modelsimage retrievalrankingexternal memory augmentation

0 comments

The pith

CLIP-based retrieval of top-k candidates followed by MLLM ranking improves accuracy on fine-grained and few-shot visual tasks with large category sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAR, a method that builds an external memory of categories using a CLIP retriever and then has the multimodal large language model rank the top-k retrieved items to produce the final prediction. This setup is intended to overcome the drop in MLLM performance that occurs when the number of categories grows large enough to strain context windows and increase decision complexity, while keeping the broad knowledge the models acquired during pre-training. The authors show the combination yields higher accuracy than either component alone on fine-grained benchmarks, few-shot datasets, and zero-shot object detection. A reader would care because the approach offers a practical route to scaling precise visual recognition to vocabularies that exceed what a single forward pass through an MLLM can reliably handle.

Core claim

RAR creates explicit category memory outside the context window with a CLIP multi-modal retriever, retrieves the top-k similar items at inference time, and lets the MLLM rank those candidates to reach a final prediction; this produces significant gains on five fine-grained visual recognition benchmarks, eleven few-shot image recognition datasets, and two object detection datasets under zero-shot recognition.

What carries the argument

CLIP multi-modal retriever that stores and queries external category memory, followed by MLLM ranking of the retrieved top-k items to select the output label.

If this is right

Accuracy rises on the five reported fine-grained visual recognition benchmarks.
Performance improves across the eleven few-shot image recognition datasets.
Zero-shot object detection accuracy increases on the two evaluated datasets.
The method maintains the MLLM's broad pre-trained knowledge while mitigating its fine-grained limitations at scale.
The external-memory design allows handling category vocabularies larger than the model's context window permits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieve-then-rank pattern could be tested on visual question answering or captioning tasks that also face large output spaces.
Replacing the CLIP retriever with a stronger or task-specific one would be a direct way to reduce failure cases where the correct label is missed in the top-k.
The separation of long-term memory from the language model's context window suggests a reusable template for other classification problems that outgrow single-pass context limits.

Load-bearing premise

The correct category will appear among the top-k items returned by the CLIP retriever so that the subsequent ranking step can still produce an accurate prediction even when the total number of categories is large.

What would settle it

Measure accuracy on a test set where the ground-truth label is forced to be absent from the CLIP top-k retrieval results; if accuracy falls to near-chance levels the central claim is falsified.

Figures

Figures reproduced from arXiv: 2403.13805 by Dahua Lin, Jiaqi Wang, Pan Zhang, Wei Li, Xiaoyi Dong, Yuanjun Xiong, Yuhang Zang, Zeyi Sun, Ziyu Liu.

**Figure 1.** Figure 1: Upper left: our motivation about the drawbacks of CLIP and MLLM. Our RAR can seamlessly integrate into MLLMs to improve the few-shot/zero-shot abilities on classification (upper right) and detection (bottom) datasets. CLIP’s performance begins to wane when faced with datasets characterized by vast vocabularies or fine-grained categories. As shown in the upper left of [PITH_FULL_IMAGE:figures/full_fig_p002… view at source ↗

**Figure 2.** Figure 2: Pipeline of RAR. (a) We design a multimodal retriever that extracts the image or text embeddings and stores embeddings in an external memory M. (b) For the inference stage of downstream recognition tasks, we retrieve top-k categories from the memory and use MLLMs to refine the retrieved results as the final prediction through ranking. M to compute similarity scores (e.g., cosine similarity) and subsequentl… view at source ↗

**Figure 3.** Figure 3: Extending our multimodal retriever to zero-shot recognition on object detection datasets such as LVIS [14] and V3Det [48]. Compared to the classification datasets, we apply the additional pre-processing techniques such as cropping and resizing to extract the image embeddings. is designed to direct the MLLMs’ focus toward the relevant objects, thereby facilitating their identification in object detection t… view at source ↗

**Figure 4.** Figure 4: Ranking Prompt examples for few-shot image classification. The finegrained image examples are from Stanford Cars [20]. We incorporate the initial top-k retrieved results (e.g., k = 5) into our ranking prompts and use the MLLMs to rank the retrieved results and make the final prediction. classification, enabling our system to handle a wide variety of images and categories with high precision and flexibili… view at source ↗

**Figure 5.** Figure 5: Visualization of the ranking examples for zero-shot object recognition on LVIS [14] validation set. Given the top retrieved predictions, our RAR uses MLLMs to select the correct class names accurately. ranking mechanisms, have once again demonstrated their robust performance in the domain of object detection datasets. Using our retrieval-augmented approach allows MLLMs to navigate the extensive and fine-gr… view at source ↗

**Figure 6.** Figure 6: Datsets used in our experiments. We select 14 classification datasets (7 finegrained and 7 common) and 2 object detection datasets as our benchmarks [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: GPT4V Example for Stanford Cars and FGVC Aircraft. Green for ground truth, blue for characteristics analyzed by GPT-4V. V3Det brings an unprecedented level of diversity to the table. The dataset includes 1,753,000 meticulously annotated bounding boxes, making it an invaluable resource for developing and testing detection algorithms capable of handling a wide variety of object types. Its large number of ca… view at source ↗

**Figure 8.** Figure 8: GPT4V Example for Flowers102, Pets37 and Food101. Green for ground truth, blue for characteristics analyzed by GPT-4V. Additionally, to assess the visual recognition and ranking capabilities of MLLMs themselves, we have prepared a prompt with examples to serve as input for the model. Our structured in-context learning prompt is as follows: “Please play the role of a classification expert, and sort the prov… view at source ↗

**Figure 9.** Figure 9: Evaluation on CLIP+KNN for Caltech101, Flowers102, RAF-DB, Pets37, DTD and UCF101. We report the top-1, 5, 10, 15, 20 accuracy (%) under the 4-shot settings [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluation on MLLMs for Caltech101, Flowers102. We report the test results using 10, 15, 20, 25, and 30 category names as inputs. we select three unlabeled images to form a 3-shot setting. Then, we extract embeddings using the CLIP B/16 model and store them in memory. The labels for each image correspond to the predictions in [31]. We then test the validation set using the RAR pipeline and measure the res… view at source ↗

**Figure 11.** Figure 11: Metric curve visualization of CLIP [41] zero-shot classification on LVIS [14] with ground truth proposals. Different behaviors can be seen before and after blurring with respect to different object’s scales. and 16-shot experiments in the supplementary materials, alongside the results of 4-shot and 8-shot experiments, all of which are presented in Tab. 8. From the 1-shot to 16-shot experiments, RAR’s resu… view at source ↗

read the original abstract

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAR's CLIP-retrieve then MLLM-rank pipeline targets large-vocab fine-grained recognition but the abstract supplies no recall@k or experimental details so the gains stay unverified.

read the letter

The main point is a straightforward retrieve-then-rank setup: CLIP builds an external memory of category embeddings, pulls the top-k matches for a query image, and the MLLM only ranks that shortlist to output the label. This keeps the MLLM inside its context window even when the total vocabulary is large. The split is sensible given CLIP's strength at broad similarity and MLLMs' strength at fine distinctions. The paper positions this as a way to handle extensive fine-grained sets without losing the MLLM's pre-trained knowledge, and it claims gains on five fine-grained benchmarks, eleven few-shot datasets, and two detection tasks under zero-shot conditions. If those results hold with proper controls, the approach could be a practical fix for deployment scenarios where category lists exceed prompt size. The soft spot is the missing retrieval evidence. For the ranking stage to ever see the correct label on large-vocab tasks, the CLIP retriever must place the ground-truth class in its top-k with high reliability; the abstract gives no recall@k numbers, no ablation that removes the correct class from the shortlist, and no failure cases on the biggest vocabularies. Without those, the performance claims rest on an unchecked assumption. The text also lacks baselines, ablations, statistical tests, or implementation details, so the reported improvements cannot be evaluated from what is here. This is aimed at people already working on retrieval-augmented vision-language systems or scaling MLLMs past context limits. A reader who needs concrete numbers on retrieval quality would get limited value until the full experiments appear. I would send it to peer review if the complete version includes the recall metrics and they look reasonable, because the underlying problem is real and the proposed split is simple enough to test.

Referee Report

3 major / 1 minor

Summary. The paper proposes RAR, a Retrieving And Ranking augmentation for MLLMs in visual recognition. It builds an external memory of category prototypes using a CLIP multi-modal retriever, retrieves top-k candidates at inference time, and has the MLLM rank them to produce the final label. The central claim is that this combination overcomes MLLM context limits on large/fine-grained vocabularies while retaining broad knowledge, yielding significant accuracy gains on 5 fine-grained benchmarks, 11 few-shot datasets, and 2 zero-shot object detection datasets.

Significance. If the reported gains are reproducible and the retriever recall is shown to be high, the approach would provide a lightweight way to scale MLLM recognition to fine-grained tasks without retraining or expanding context windows, addressing a practical limitation in current vision-language systems.

major comments (3)

[Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.
[Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.
[Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.

minor comments (1)

[Abstract] The abstract lists the number of benchmarks but does not name them or cite the original papers; adding these references would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the performance claims for 'extensive and fine-grained vocabularies' rest on the unverified assumption that the CLIP retriever places the ground-truth category inside the top-k for the evaluated datasets; no recall@k tables, no ablation that removes the correct class from the retrieved set, and no failure-case analysis on the largest-vocabulary tasks are supplied, rendering the support for the accuracy improvements unassessable.

Authors: We agree that direct verification of retriever recall would make the claims more robust. While the reported accuracy gains across benchmarks imply that the ground-truth is frequently retrieved within top-k, we will add recall@k tables, an ablation that excludes the ground-truth from the candidate pool, and failure-case analysis on the largest-vocabulary tasks in the revised version. revision: yes
Referee: [Method] Method section: the construction of the external memory (how prototypes are computed and stored) and the exact MLLM ranking prompt are described at a high level only; without these details it is impossible to determine whether the reported gains depend on dataset-specific tuning or generalize.

Authors: We will expand the method section to include the precise computation of category prototypes (CLIP image and text embeddings averaged per class and stored in the memory bank) and the verbatim MLLM ranking prompt template. These additions will demonstrate that the approach relies on standard CLIP and MLLM components without per-dataset hyperparameter tuning. revision: yes
Referee: [Experiments] Experiments section: the text supplies no baseline comparisons, no statistical significance tests, no ablation on the choice of k, and no breakdown by dataset size or vocabulary cardinality, all of which are load-bearing for the cross-benchmark claims.

Authors: The experiments already include direct comparisons to MLLM-only and CLIP baselines on all reported datasets. To further strengthen the presentation, we will add statistical significance tests (e.g., McNemar or paired t-tests), an ablation study varying k, and performance tables stratified by vocabulary size and number of shots. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical retrieval-ranking method with no self-referential derivations

full rationale

The paper proposes an empirical pipeline (CLIP-based multi-modal retriever populates external memory; top-k items are ranked by MLLM at inference) whose performance claims rest entirely on reported benchmark numbers across fine-grained, few-shot, and detection tasks. No equations, parameters, or uniqueness theorems are defined in terms of the target outputs; the method description contains no fitted-input-called-prediction, self-definitional, or self-citation-load-bearing steps. The central claim is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that CLIP retrieval supplies useful candidates and on the hyperparameter choice of retrieval count; no new entities are postulated.

free parameters (1)

top-k
Number of retrieved candidates passed to the MLLM for ranking; chosen as a hyperparameter without reported fitting procedure.

axioms (1)

domain assumption CLIP embeddings enable effective retrieval of relevant category examples for fine-grained visual tasks.
Invoked to justify construction of the multi-modal retriever and external memory.

pith-pipeline@v0.9.0 · 5814 in / 1266 out tokens · 36277 ms · 2026-05-24T03:29:14.125046+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

arXiv.org (2023) 2, 4

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv.org (2023) 2, 4

work page 2023
[2]

arXiv.org (2023) 2, 4, 12

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org (2023) 2, 4, 12

work page 2023
[3]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: CVPR (2014) 9, 19

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014) 9, 19

work page 2014
[5]

In: NeurIPS (2024) 10

Conti, A., Fini, E., Mancini, M., Rota, P., Wang, Y., Ricci, E.: Vocabulary-free image classification. In: NeurIPS (2024) 10

work page 2024
[6]

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4

work page 2023
[7]

In: CVPR (2009) 9, 19

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009) 9, 19

work page 2009
[8]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., Zeng, M., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N.: Maskclip: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10995–11005 (June 2023) 1, 3

work page 2023
[9]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024) 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

In: CVPR (2023) 3

Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2023) 3

work page 2023
[11]

In: CVPR workshop (2004) 9, 19

Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. In: CVPR workshop (2004) 9, 19

work page 2004
[12]

IJCV (2023) 3

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- Adapter: Better vision-language models with feature adapters. IJCV (2023) 3

work page 2023
[13]

In: ICLR (2022) 3 16 Ziyu Liu et al

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 3 16 Ziyu Liu et al

work page 2022
[14]

In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

work page 2019
[15]

Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (2019) 9, 19

work page 2019
[16]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

In: CVPR (2023) 4

Iscen, A., Fathi, A., Schmid, C.: Improving image recognition by retrieving from web-scale image-text data. In: CVPR (2023) 4

work page 2023
[18]

In: ICML (2021) 3

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021) 3

work page 2021
[19]

In: CVPR workshop (2011) 9, 19

Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR workshop (2011) 9, 19

work page 2011
[20]

In: ICCV workshops (2013) 8, 9, 19

Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: ICCV workshops (2013) 8, 9, 19

work page 2013
[21]

NeurIPS (2020) 4

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-Augmented generation for knowledge-intensive nlp tasks. NeurIPS (2020) 4

work page 2020
[22]

In: ICML (2023) 10

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 10

work page 2023
[23]

In: ICML (2022) 3

Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022) 3

work page 2022
[24]

In: CVPR (2022) 3

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: CVPR (2022) 3

work page 2022
[25]

In: CVPR (2017) 9, 19

Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learn- ing for expression recognition in the wild. In: CVPR (2017) 9, 19

work page 2017
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre- training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23390–23400 (June 2023) 1, 3

work page 2023
[27]

In: CVPR (2023) 3

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023) 3

work page 2023
[28]

Improved Baselines with Visual Instruction Tuning

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023) 2, 11, 12, 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

In: NeurIPS (2024) 2, 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024) 2, 4

work page 2024
[30]

In: CVPR (2023) 4

Liu, H., Son, K., Yang, J., Liu, C., Gao, J., Lee, Y.J., Li, C.: Learning customized visual models with retrieval-augmented knowledge. In: CVPR (2023) 4

work page 2023
[31]

In: ICLR (2024) 4, 9, 10, 22, 23

Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024) 4, 9, 10, 22, 23

work page 2024
[32]

In: CVPR (2022) 4

Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., van den Hengel, A.: Retrieval augmented classification for long-tail visual recognition. In: CVPR (2022) 4

work page 2022
[33]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 7086–7096 (June 2022) 3, 6 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 17

work page 2022
[34]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 9, 19

work page internal anchor Pith review Pith/arXiv arXiv 2013
[35]

TPAMI (2018) 6, 9

Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018) 6, 9

work page 2018
[36]

Communications of the ACM (1995) 10

Miller, G.A.: WordNet: a lexical database for english. Communications of the ACM (1995) 10

work page 1995
[37]

In: ICVGIP (2008) 9, 19

Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008) 9, 19

work page 2008
[38]

OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card 2, 4, 5, 25

work page 2023
[39]

In: CVPR (2012) 9, 19

Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012) 9, 19

work page 2012
[40]

arXiv.org (2023) 2, 4

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv.org (2023) 2, 4

work page 2023
[41]

In: ICML (2021) 1, 3, 4, 11, 26, 27

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 1, 3, 4, 11, 26, 27

work page 2021
[42]

arXiv preprint arXiv:2304.06712 (2023) 3

Shtedritski,A.,Rupprecht,C.,Vedaldi,A.:Whatdoesclipknowaboutaredcircle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 3

work page arXiv 2023
[43]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 19

work page internal anchor Pith review Pith/arXiv arXiv 2012
[44]

In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers)

Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: A strong zero-shot baseline for referring expression comprehension. In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pp. 5198–5215 (2022) 3

work page 2022
[45]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.:Alpha-CLIP:A clipmodel focusingon whereveryouwant.arXiv preprint arXiv:2312.03818 (2023) 3

work page arXiv 2023
[47]

Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011 (2011) 9, 19

work page 2011
[48]

In: ICCV (2023) 3, 7, 9, 12, 28

Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3Det: Vast vocabulary visual detection dataset. In: ICCV (2023) 3, 7, 9, 12, 28

work page 2023
[49]

Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 4, 12

work page 2023
[50]

Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: GPT4Vis: What can gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023) 25

work page arXiv 2023
[51]

In: CVPR (2010) 9, 19

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large- scale scene recognition from abbey to zoo. In: CVPR (2010) 9, 19

work page 2010
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023) 3

work page 2023
[53]

arXiv preprint arXiv:2311.01373 (2023) 3

Yang, H., Ma, C., Wen, B., Jiang, Y., Yuan, Z., Zhu, X.: Recognize any regions. arXiv preprint arXiv:2311.01373 (2023) 3

work page arXiv 2023
[54]

Yang, L., Wang, Y., Li, X., Wang, X., Yang, J.: Fine-grained visual prompting (2023) 3, 6 18 Ziyu Liu et al

work page 2023
[55]

In: ICML (2023) 3, 4

Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., Yih, W.t.: Retrieval-augmented multimodal language model- ing. In: ICML (2023) 3, 4

work page 2023
[56]

arXiv.org (2023) 2, 4

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv.org (2023) 2, 4

work page 2023
[57]

In: ECCV (2022) 3

Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary detr with conditional matching. In: ECCV (2022) 3

work page 2022
[58]

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Zhang, P., Wang, X.D.B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.: Internlm-Xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

In: CVPR (2022) 3, 11, 12

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: RegionCLIP: Region-based language-image pretraining. In: CVPR (2022) 3, 11, 12

work page 2022
[60]

IJCV (2022) 3

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022) 3

work page 2022
[61]

In: ECCV (2022) 3

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022) 3

work page 2022
[62]

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 2, 4 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 19 Appendix In this appendix, we provide a series of detailed supporting materials to aid in a deeper und...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

2012 Aston Martin V8 Vantage Coupe

work page 2012
[64]

2012 Aston Martin Virage Coupe

work page 2012
[65]

2012 Aston Martin V8 Vantage Convertible

work page 2012
[66]

2012 Aston Martin Virage Convertible

work page 2012
[67]

2012 Audi TTS Coupe Please play the role of a classification expert, and sort the provided categories from high to low according to the top 5 similarity with the input image. Here are the optional categories: ['737-800', '757-300', '767-200', 'A340-300', 'MD-90'] Based on the input image and the provided categories, the sorting from high to low similarity...

work page 2012
[68]

The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured

A340-300 The aircraft in the image is a Boeing 757, which is identifiable by its long fuselage, narrow-body, and characteristic nose shape. The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured. The 767-200 and MD-90 have some similarities but differ in specific design elements such as winglets a...

work page
[69]

The other flowers listed have different characteristics that do not match as closely

Columbine The flower in the image appears to be a Japanese anemone, which typically has white petals surrounding a yellow center of stamens and a greenish pistil, matching the one shown. The other flowers listed have different characteristics that do not match as closely. Please play the role of a classification expert, and sort the provided categories fr...

work page
[70]

Beagle The dog in the image is a Chihuahua, identifiable by its small size, prominent ears, and facial features. Miniature Pinschers are somewhat similar in size and ear shape, while Yorkshire Terriers, Havanese, and Beagles have distinct differences in coat, size, and facial structure compared to the Chihuahua. Please play the role of a classification ex...

work page
[71]

coupe” (a two-door car), “long fuselage

Donuts The dish in the image appears to include falafel balls and a side of hummus, which are typically found in Middle Eastern cuisine. The other items, such as beet salad, breakfast burrito, and donuts, do not seem to be present in the image or resemble the food shown. Why did you give this order? Why did you give this order? Why did you give this order...

work page

[1] [1]

arXiv.org (2023) 2, 4

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv.org (2023) 2, 4

work page 2023

[2] [2]

arXiv.org (2023) 2, 4, 12

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org (2023) 2, 4, 12

work page 2023

[3] [3]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

In: CVPR (2014) 9, 19

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: CVPR (2014) 9, 19

work page 2014

[5] [5]

In: NeurIPS (2024) 10

Conti, A., Fini, E., Mancini, M., Rota, P., Wang, Y., Ricci, E.: Vocabulary-free image classification. In: NeurIPS (2024) 10

work page 2024

[6] [6]

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023) 2, 4

work page 2023

[7] [7]

In: CVPR (2009) 9, 19

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR (2009) 9, 19

work page 2009

[8] [8]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Dong, X., Bao, J., Zheng, Y., Zhang, T., Chen, D., Yang, H., Zeng, M., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N.: Maskclip: Masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10995–11005 (June 2023) 1, 3

work page 2023

[9] [9]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: InternLM-XComposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024) 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

In: CVPR (2023) 3

Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: EVA: Exploring the limits of masked visual representation learning at scale. In: CVPR (2023) 3

work page 2023

[11] [11]

In: CVPR workshop (2004) 9, 19

Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. In: CVPR workshop (2004) 9, 19

work page 2004

[12] [12]

IJCV (2023) 3

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- Adapter: Better vision-language models with feature adapters. IJCV (2023) 3

work page 2023

[13] [13]

In: ICLR (2022) 3 16 Ziyu Liu et al

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: ICLR (2022) 3 16 Ziyu Liu et al

work page 2022

[14] [14]

In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: CVPR (2019) 7, 9, 12, 13, 26, 27, 28

work page 2019

[15] [15]

Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (2019) 9, 19

work page 2019

[16] [16]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

In: CVPR (2023) 4

Iscen, A., Fathi, A., Schmid, C.: Improving image recognition by retrieving from web-scale image-text data. In: CVPR (2023) 4

work page 2023

[18] [18]

In: ICML (2021) 3

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021) 3

work page 2021

[19] [19]

In: CVPR workshop (2011) 9, 19

Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR workshop (2011) 9, 19

work page 2011

[20] [20]

In: ICCV workshops (2013) 8, 9, 19

Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: ICCV workshops (2013) 8, 9, 19

work page 2013

[21] [21]

NeurIPS (2020) 4

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-Augmented generation for knowledge-intensive nlp tasks. NeurIPS (2020) 4

work page 2020

[22] [22]

In: ICML (2023) 10

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 10

work page 2023

[23] [23]

In: ICML (2022) 3

Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022) 3

work page 2022

[24] [24]

In: CVPR (2022) 3

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: CVPR (2022) 3

work page 2022

[25] [25]

In: CVPR (2017) 9, 19

Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learn- ing for expression recognition in the wild. In: CVPR (2017) 9, 19

work page 2017

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre- training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 23390–23400 (June 2023) 1, 3

work page 2023

[27] [27]

In: CVPR (2023) 3

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023) 3

work page 2023

[28] [28]

Improved Baselines with Visual Instruction Tuning

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023) 2, 11, 12, 24

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

In: NeurIPS (2024) 2, 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024) 2, 4

work page 2024

[30] [30]

In: CVPR (2023) 4

Liu, H., Son, K., Yang, J., Liu, C., Gao, J., Lee, Y.J., Li, C.: Learning customized visual models with retrieval-augmented knowledge. In: CVPR (2023) 4

work page 2023

[31] [31]

In: ICLR (2024) 4, 9, 10, 22, 23

Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., Ricci, E.: Democratizing fine-grained visual recognition with large language models. In: ICLR (2024) 4, 9, 10, 22, 23

work page 2024

[32] [32]

In: CVPR (2022) 4

Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., van den Hengel, A.: Retrieval augmented classification for long-tail visual recognition. In: CVPR (2022) 4

work page 2022

[33] [33]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 7086–7096 (June 2022) 3, 6 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 17

work page 2022

[34] [34]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 9, 19

work page internal anchor Pith review Pith/arXiv arXiv 2013

[35] [35]

TPAMI (2018) 6, 9

Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018) 6, 9

work page 2018

[36] [36]

Communications of the ACM (1995) 10

Miller, G.A.: WordNet: a lexical database for english. Communications of the ACM (1995) 10

work page 1995

[37] [37]

In: ICVGIP (2008) 9, 19

Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008) 9, 19

work page 2008

[38] [38]

OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card 2, 4, 5, 25

work page 2023

[39] [39]

In: CVPR (2012) 9, 19

Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012) 9, 19

work page 2012

[40] [40]

arXiv.org (2023) 2, 4

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv.org (2023) 2, 4

work page 2023

[41] [41]

In: ICML (2021) 1, 3, 4, 11, 26, 27

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 1, 3, 4, 11, 26, 27

work page 2021

[42] [42]

arXiv preprint arXiv:2304.06712 (2023) 3

Shtedritski,A.,Rupprecht,C.,Vedaldi,A.:Whatdoesclipknowaboutaredcircle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712 (2023) 3

work page arXiv 2023

[43] [43]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 19

work page internal anchor Pith review Pith/arXiv arXiv 2012

[44] [44]

In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers)

Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: Reclip: A strong zero-shot baseline for referring expression comprehension. In: Pro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pp. 5198–5215 (2022) 3

work page 2022

[45] [45]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training tech- niques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.:Alpha-CLIP:A clipmodel focusingon whereveryouwant.arXiv preprint arXiv:2312.03818 (2023) 3

work page arXiv 2023

[47] [47]

Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011 (2011) 9, 19

work page 2011

[48] [48]

In: ICCV (2023) 3, 7, 9, 12, 28

Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D.: V3Det: Vast vocabulary visual detection dataset. In: ICCV (2023) 3, 7, 9, 12, 28

work page 2023

[49] [49]

Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 2, 4, 12

work page 2023

[50] [50]

Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: GPT4Vis: What can gpt-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023) 25

work page arXiv 2023

[51] [51]

In: CVPR (2010) 9, 19

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large- scale scene recognition from abbey to zoo. In: CVPR (2010) 9, 19

work page 2010

[52] [52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu, X., Xiong, T., Ding, Z., Tu, Z.: Masqclip for open-vocabulary universal im- age segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 887–898 (2023) 3

work page 2023

[53] [53]

arXiv preprint arXiv:2311.01373 (2023) 3

Yang, H., Ma, C., Wen, B., Jiang, Y., Yuan, Z., Zhu, X.: Recognize any regions. arXiv preprint arXiv:2311.01373 (2023) 3

work page arXiv 2023

[54] [54]

Yang, L., Wang, Y., Li, X., Wang, X., Yang, J.: Fine-grained visual prompting (2023) 3, 6 18 Ziyu Liu et al

work page 2023

[55] [55]

In: ICML (2023) 3, 4

Yasunaga, M., Aghajanyan, A., Shi, W., James, R., Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., Yih, W.t.: Retrieval-augmented multimodal language model- ing. In: ICML (2023) 3, 4

work page 2023

[56] [56]

arXiv.org (2023) 2, 4

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv.org (2023) 2, 4

work page 2023

[57] [57]

In: ECCV (2022) 3

Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary detr with conditional matching. In: ECCV (2022) 3

work page 2022

[58] [58]

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Zhang, P., Wang, X.D.B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.: Internlm-Xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

In: CVPR (2022) 3, 11, 12

Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: RegionCLIP: Region-based language-image pretraining. In: CVPR (2022) 3, 11, 12

work page 2022

[60] [60]

IJCV (2022) 3

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022) 3

work page 2022

[61] [61]

In: ECCV (2022) 3

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: ECCV (2022) 3

work page 2022

[62] [62]

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 2, 4 RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition 19 Appendix In this appendix, we provide a series of detailed supporting materials to aid in a deeper und...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

2012 Aston Martin V8 Vantage Coupe

work page 2012

[64] [64]

2012 Aston Martin Virage Coupe

work page 2012

[65] [65]

2012 Aston Martin V8 Vantage Convertible

work page 2012

[66] [66]

2012 Aston Martin Virage Convertible

work page 2012

[67] [67]

2012 Audi TTS Coupe Please play the role of a classification expert, and sort the provided categories from high to low according to the top 5 similarity with the input image. Here are the optional categories: ['737-800', '757-300', '767-200', 'A340-300', 'MD-90'] Based on the input image and the provided categories, the sorting from high to low similarity...

work page 2012

[68] [68]

The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured

A340-300 The aircraft in the image is a Boeing 757, which is identifiable by its long fuselage, narrow-body, and characteristic nose shape. The -300 series has an extended fuselage compared to other variants, which appears to match the aircraft pictured. The 767-200 and MD-90 have some similarities but differ in specific design elements such as winglets a...

work page

[69] [69]

The other flowers listed have different characteristics that do not match as closely

Columbine The flower in the image appears to be a Japanese anemone, which typically has white petals surrounding a yellow center of stamens and a greenish pistil, matching the one shown. The other flowers listed have different characteristics that do not match as closely. Please play the role of a classification expert, and sort the provided categories fr...

work page

[70] [70]

Beagle The dog in the image is a Chihuahua, identifiable by its small size, prominent ears, and facial features. Miniature Pinschers are somewhat similar in size and ear shape, while Yorkshire Terriers, Havanese, and Beagles have distinct differences in coat, size, and facial structure compared to the Chihuahua. Please play the role of a classification ex...

work page

[71] [71]

coupe” (a two-door car), “long fuselage

Donuts The dish in the image appears to include falafel balls and a side of hummus, which are typically found in Middle Eastern cuisine. The other items, such as beet salad, breakfast burrito, and donuts, do not seem to be present in the image or resemble the food shown. Why did you give this order? Why did you give this order? Why did you give this order...

work page