Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners

Edson Araujo; Eshika Khandelwal; Hilde Kuehne; Nina Shvetsova; Walid Bousselham; Yunhan Wang

arxiv: 2607.00125 · v1 · pith:LIP7SVXJnew · submitted 2026-06-30 · 💻 cs.CV

Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners

Yunhan Wang , Eshika Khandelwal , Edson Araujo , Walid Bousselham , Nina Shvetsova , Hilde Kuehne This is my paper

Pith reviewed 2026-07-02 19:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-shot classificationmultimodal LLMspairwise comparisonimage classificationdecompositionsimilarity scoringtraining-free adaptation

0 comments

The pith

Off-the-shelf multimodal LLMs become strong few-shot image classifiers by decomposing the task into pairwise same-class decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that few-shot image classification can be reframed as a series of binary questions posed to an MLLM, where each question asks whether a query image and a support image from a candidate class depict the same thing. The logit attached to an affirmative answer is treated as a similarity score that ranks classes and assigns the query image. Adding domain context to the prompt raises accuracy further, and the resulting procedure requires no training or parameter updates. A sympathetic reader would care because the result indicates that existing MLLMs already encode enough class knowledge to handle few-shot problems once the task is cast in the right form.

Core claim

DeCoDe decomposes few-shot classification into a collection of pairwise binary decisions by prompting an MLLM to judge whether a query image and a support image belong to the same class; the logit of the affirmative token is then used directly as a similarity score to select the most likely class for the query. Supplying high-level domain information in the prompt improves the scores. On twelve datasets the method exceeds current specialized few-shot baselines without any additional training.

What carries the argument

Decomposition of few-shot classification into binary same-class prompts whose affirmative logits serve as cross-image similarity scores.

If this is right

MLLMs can perform few-shot image classification without any training or fine-tuning.
Including domain context in the prompt measurably raises classification accuracy.
The same decomposition works on both established benchmarks and newly curated datasets from varied domains.
Pairwise comparison is sufficient to surface classification capability already present in off-the-shelf MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that MLLMs already contain implicit representations of visual class similarity that prompting can surface.
Analogous decompositions into binary decisions could be tested on other multimodal tasks such as few-shot detection or retrieval.
If the logit-based scores prove stable, the approach offers a training-free route to adapt MLLMs to new visual domains.

Load-bearing premise

The logit of an MLLM's affirmative response to a pairwise same-class prompt forms a reliable and comparable similarity measure across classes and datasets.

What would settle it

On a held-out dataset the similarity scores produced by the pairwise prompts fail to rank support images by true class membership better than chance or a simple baseline.

Figures

Figures reproduced from arXiv: 2607.00125 by Edson Araujo, Eshika Khandelwal, Hilde Kuehne, Nina Shvetsova, Walid Bousselham, Yunhan Wang.

**Figure 1.** Figure 1: We propose a decomposed prompting technique (DeCoDe) for few-shot classification with MLLMs. We decompose the task into pairwise support–query comparisons, asking whether two images belong to the same class. By ranking the model’s affirmative responses across candidate pairs (compare) and selecting the highest-scoring logit as the predicted class (decide), MLLMs become strong few-shot classifiers withou… view at source ↗

**Figure 2.** Figure 2: Variants of prompt formulations used in our experiments. (a) standard incontext prompting with semantic labels. (b) standard in-context prompting with anonymized labels. (c) decomposed pairwise prompting with semantic labels. (d) decomposed pairwise prompting with anonymized labels. (e) decomposed pairwise prompting with domain information and semantic labels. (f) decomposed pairwise prompting with domain… view at source ↗

**Figure 3.** Figure 3: Scaling with N-way using Qwen3-VL. (a) Accuracy comparison between In-context (with and without SFT) and Decompose + domain info. (b) Corresponding runtime analysis under identical decoding and batching settings. N ∈ {3, 5, 10, 20}. 4.6 N-way 1-shot Analysis We further analyze the scalability with respect to the number of classes N ∈ {3, 5, 10, 20} under the 1-shot setting. In Fig. 3a, In-context inferenc… view at source ↗

**Figure 4.** Figure 4: Example images from the novel datasets (ordered top-down): Lego bricks [15], Industrial parts [36], Yoga [35], Egyptian hieroglyphs [13], Flying insects [32], Arabic sign language [1] [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative 5-way 1-shot episodic classification accuracy across four datasets for three MLLMs. The x-axis shows evaluated episodes (up to 1000; 5 episodes per logging step), and the y-axis shows cumulative accuracy. Solid lines correspond to prompting setups, where semantic denotes using semantic labels, anon. denotes removing semantic labels, and dec. denotes decomposed prompting (0 shot, 1 shot semantic,… view at source ↗

**Figure 6.** Figure 6: Logit distribution for decomposed prompting using Qwen2.5-VL on Yoga (left) and Mini-ImageNet (right). For each decomposed inference in each 5-way 1-shot episode, we collect the top-10 predicted tokens over all support–query comparisons. Bars show token frequency; Yes/No are highlighted as the intended answer tokens. the top ten logits by their logit score in each episode. Each episode corresponds to five … view at source ↗

**Figure 7.** Figure 7: Failure Cases of our DeCoDe method with labels and without Dinfo [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable abilities when analyzing images, yet translating these capabilities to few-shot image classification remains challenging. To bridge this gap, we present DeCoDe, a simple yet effective technique that enables off-the-shelf MLLMs to act as strong few-shot classifiers without any additional training. Our approach builds on the idea of few-shot classification as a set of pairwise image comparisons, decomposing the task into a set of binary decisions. Given a query image and a support image from a candidate class, the MLLM is prompted to decide whether the two images depict the same class. The logit corresponding to an affirmative response is then used as a similarity score to assign the query image to the most likely class. While this already yields good results, we show that providing additional high-level information, such as the data domain, to the model further improves performance. Our evaluation provides an extensive analysis of various inference variants on a suite of twelve datasets, six established and six newly curated few-shot benchmarks spanning across diverse domains. The results show that the proposed simple decomposition technique can turn off-the-shelf MLLMs into powerful few-shot learners, significantly outperforming current state-of-the-art few-shot methods on both standard and novel domains. Code is available at https://github.com/yunhanwang1105/DeCoDe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeCoDe turns MLLM pairwise yes-logits into a training-free few-shot classifier and beats some baselines on twelve datasets, but the cross-class comparability of those logits remains the open question.

read the letter

The main takeaway is that this paper decomposes few-shot classification into many binary "same class?" prompts to an off-the-shelf MLLM and ranks classes by the logit on the affirmative token. They report that this already works reasonably well and improves further when domain information is added to the prompt.

What is actually new is the explicit framing of the task as a collection of pairwise decisions scored directly from the model's next-token distribution instead of from visual embeddings or learned prototypes. The evaluation covers twelve datasets, six of them newly assembled to hit more diverse domains, and the code is released. That breadth is the part that stands up.

The soft spot is exactly the one flagged in the stress test. Raw affirmative logits are treated as monotonic, cross-class similarity scores, yet MLLMs are known to produce context-sensitive and poorly calibrated outputs. If pairs involving common pretraining categories systematically receive higher logits, the argmax can succeed for the wrong reason. The fact that injecting domain text helps is consistent with prompt sensitivity rather than a pure visual comparison. Without ablations on prompt wording, class-frequency effects, or calibration checks, it is hard to know how much of the reported gain comes from the decomposition itself.

This is for people who already work with MLLMs and want a quick prompting route for new classification problems. A reader who needs a training-free baseline would find the numbers useful to check. It deserves peer review because the empirical scope is decent and the method is cheap to reproduce, even though the logit assumption will need direct pushback from referees.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DeCoDe, a prompting technique that decomposes few-shot image classification into a series of pairwise binary decisions: an MLLM is queried whether a support image and query image belong to the same class, and the logit of the affirmative token is used directly as a similarity score to select the class with the highest score. The base procedure is augmented by optionally injecting high-level domain information into the prompt. Extensive experiments are reported across twelve datasets (six established, six newly curated), with claims of significant outperformance over existing few-shot methods in both standard and novel domains.

Significance. If the central results hold after addressing calibration concerns, the work would demonstrate that off-the-shelf MLLMs can function as strong few-shot classifiers via a simple, training-free decomposition, with broad applicability across domains. The release of code and the introduction of new benchmarks are positive contributions to reproducibility and evaluation standards in multimodal few-shot learning.

major comments (1)

[Abstract] Abstract (paragraph describing the scoring procedure): the method assumes the raw affirmative logit constitutes a reliable, monotonic, and cross-class/cross-dataset similarity measure suitable for argmax assignment. No analysis, normalization, or ablation is described that tests whether these logits are comparably scaled or free from class-specific biases (e.g., higher values for frequent pretraining categories). This assumption is load-bearing for the claim that the decomposition itself turns MLLMs into powerful few-shot learners, as unaddressed logit-scale variation could produce the reported gains through bias rather than visual comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scoring procedure. We address the concern regarding the use of raw affirmative logits below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing the scoring procedure): the method assumes the raw affirmative logit constitutes a reliable, monotonic, and cross-class/cross-dataset similarity measure suitable for argmax assignment. No analysis, normalization, or ablation is described that tests whether these logits are comparably scaled or free from class-specific biases (e.g., higher values for frequent pretraining categories). This assumption is load-bearing for the claim that the decomposition itself turns MLLMs into powerful few-shot learners, as unaddressed logit-scale variation could produce the reported gains through bias rather than visual comparison.

Authors: We agree that the abstract does not include an explicit analysis of logit scaling or potential class-specific biases. The full manuscript reports results across twelve datasets with diverse class distributions and domains, where the method outperforms baselines without normalization; this cross-dataset consistency provides indirect evidence that the logits function as effective relative similarity measures for argmax selection. To directly address the concern, we will add a dedicated analysis section with ablations on logit distributions, monotonicity checks, and simple calibration experiments (e.g., per-query normalization) to verify that performance gains derive from visual comparisons rather than pretraining biases. revision: yes

Circularity Check

0 steps flagged

No circularity; method is prompting procedure validated externally

full rationale

The paper describes a prompting-based decomposition (pairwise 'same class?' queries, affirmative logit as similarity score) evaluated on external benchmarks across twelve datasets. No equations, fitted parameters, or self-citation chains appear in the provided text that reduce the reported performance to inputs defined inside the paper. The central claim rests on empirical comparison to SOTA few-shot methods rather than any internal derivation that loops back by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current MLLMs can produce usable binary similarity judgments from natural-language prompts; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption Multimodal LLMs produce logits for affirmative answers to 'same class?' prompts that can be interpreted as comparable similarity scores across classes.
This assumption is required for the logit-based scoring rule described in the abstract.

pith-pipeline@v0.9.1-grok · 5799 in / 1318 out tokens · 28096 ms · 2026-07-02T19:45:27.686856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 9 canonical work pages · 8 internal anchors

[1]

arXiv preprint arXiv:2301.11932 (2023)

Al-Barham, M., Alsharkawi, A., Al-Yaman, M., Al-Fetyani, M., Elnagar, A., SaAleek, A.A., Al-Odat, M.: Rgb arabic alphabets sign language dataset. arXiv preprint arXiv:2301.11932 (2023)

work page arXiv 2023
[2]

In: CVPR (2023)

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: CVPR (2023)

2023
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Baldassini, F.B., Shukor, M., Cord, M., Soulier, L., Piwowarski, B.: What makes multimodal in-context learning work? In: CVPR (2024)

2024
[6]

In: CVPR (2025)

Bendou, Y., Ouasfi, A., Gripon, V., Boukhayma, A.: Proker: A kernel perspective on few-shot adaptation of large vision-language models. In: CVPR (2025)

2025
[7]

In: ICLR (2023)

Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: Prompt learning with optimal transport for vision-language models. In: ICLR (2023)

2023
[8]

In: ICLR (2025)

Chi, Z., Gu, L., Liu, H., Wang, Z., Wu, Y., Wang, Y., Plataniotis, K.N.: Learning to adapt frozen CLIP for few-shot test-time domain adaptation. In: ICLR (2025)

2025
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

In: CVPR (2025)

Farina, M., Mancini, M., Iacca, G., Ricci, E.: Rethinking few-shot adaptation of vision-language models in two stages. In: CVPR (2025)

2025
[11]

In: ICLR (2024)

Fifty, C., Duan, D., Junkins, R.G., Amid, E., Leskovec, J., Re, C., Thrun, S.: Context-aware meta-learning. In: ICLR (2024)

2024
[12]

In: ICML (2017)

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

2017
[13]

In: ACM MM (2013)

Franken, M., van Gemert, J.C.: Automatic egyptian hieroglyph recognition by retrieving images as texts. In: ACM MM (2013)

2013
[14]

IJCV (2024)

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- adapter: Better vision-language models with feature adapters. IJCV (2024)

2024
[15]

Garciam, P.: Lego brick sorting image recognition (2019), kaggle

2019
[16]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022
[17]

In: CVPR (2025)

Hu, Z., Wei, Y., Shen, L., Yuan, C., Tao, D.: Unlocking tuning-free few-shot adapt- ability in visual foundation models by recycling pre-tuned loras. In: CVPR (2025)

2025
[18]

In: CVPR (2023)

Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. In: CVPR (2023)

2023
[19]

Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs
[20]

In: ICCV (2025)

Kravets, A., Chen, D., Namboodiri, V.P.: Rethinking few shot clip benchmarks: A critical analysis in the inductive setting. In: ICCV (2025)

2025
[21]

In: ICCV (2021) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 17

Kukleva, A., Kuehne, H., Schiele, B.: Generalized and incremental few-shot learn- ing by explicit learning and calibration without forgetting. In: ICCV (2021) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 17

2021
[22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

In: CVPR (2025)

Li, S., Liu, F., Hao, Z., Wang, X., Li, L., Liu, X., Chen, P., Ma, W.: Logits decon- fusion with clip for few-shot learning. In: CVPR (2025)

2025
[24]

In: ECCV (2018)

Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without rep- resentation bias. In: ECCV (2018)

2018
[25]

In: ECCV (2024)

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: Evaluating text-to-visual generation with image-to-text generation. In: ECCV (2024)

2024
[26]

In: AAAI (2025)

Liu, F., Cai, W., Huo, J., Zhang, C., Chen, D., Zhou, J.: Making large vision language models to be good few-shot learners. In: AAAI (2025)

2025
[27]

Fine-Grained Visual Classification of Aircraft

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

In: ICCV (2025)

Mitra, C., Huang, B., Chai, T., Lin, Z., Arbelle, A., Feris, R., Karlinsky, L., Darrell, T., Ramanan, D., Herzig, R.: Enhancing few-shot vision-language classification with large multimodal model features. In: ICCV (2025)

2025
[29]

NVIDIA: NVIDIA H100 Tensor Core GPU Architecture (2022), whitepaper

2022
[30]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: ICCV (2019)

Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)

2019
[32]

Piosenka, G.: Butterfly and moths image classification 100 species (2023), kaggle

2023
[33]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

2021
[34]

In: ICLR (2017)

Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)

2017
[35]

Saxena, S.: Yoga pose image classification dataset (2021), kaggle

2021
[36]

Schuerrle, B., Sankarappan, V.: Industrial classification dataset (2023), kaggle

2023
[37]

In: NeurIPS (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

2022
[38]

In: CVPR (2025)

Shvetsova, N., Nagrani, A., Schiele, B., Kuehne, H., Rupprecht, C.: Unbiasing through textual descriptions: Mitigating representation bias in video benchmarks. In: CVPR (2025)

2025
[39]

In: NeurIPS (2017)

Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)

2017
[40]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012
[41]

In: CVPR (2018)

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: CVPR (2018)

2018
[42]

Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking few-shot image classification: a good embedding is all you need? In: ECCV (2020)

2020
[43]

In: ICCV (2023)

Udandarao, V., Gupta, A., Albanie, S.: Sus-x: Training-free name-only transfer of vision-language models. In: ICCV (2023)

2023
[44]

In: NeurIPS (2016) 18 Y

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016) 18 Y. Wang et al

2016
[45]

Wah,C.,Branson,S.,Welinder,P.,Perona,P.,Belongie,S.,etal.:Thecaltech-ucsd birds-200-2011 dataset. Tech. rep

2011
[46]

In: ICCV (2025)

Yang, C.F., Yin, D., Hu, W., Ji, H., Peng, N., Zhou, B., Chang, K.W.: Verbalized representation learning for interpretable few-shot generalization. In: ICCV (2025)

2025
[47]

In: CVPR (2023)

Yao, H., Zhang, R., Xu, C.: Visual-language prompt tuning with knowledge-guided context optimization. In: CVPR (2023)

2023
[48]

In: CVPR (2023)

Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: CVPR (2023)

2023
[49]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

2023
[50]

In: ECCV (2022)

Zhang, R., Wei, Z., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: ECCV (2022)

2022
[51]

In: ICML (2021)

Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Improving few-shot performance of language models. In: ICML (2021)

2021
[52]

In: CVPR (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision- language models. In: CVPR (2022)

2022
[53]

IJCV (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)

2022
[54]

In: ICCV (2023)

Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: ICCV (2023)

2023
[55]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 19 A Supplementary Materials A.1 Dataset Details Dataset Classes T ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Put the query image first, followed by the support images (query first)
[57]

Redefine the few-shot classification problem as an in-context visual matching task (visual match)
[58]

Present the support images and query image first, followed by the text de- scription and instruction (images then text)
[59]

Standardin-contextpromptinChainofThought(CoT)style,weuseQwen3- VL-Thinking-8B for this prompt, and set max_token=600. 26 Y. Wang et al. Novel Datasets Prompt Setting Y oga Hiero. Sign A vg. With semantic label Standard in-context 74.582.4 68.4 75.1
[60]

Query first76.777.8 56.1 70.2
[61]

Images then text 9.0 13.8 12.5 11.8
[62]

CoT (Thinking) 41.7 68.4 24.1 44.7 Anonymous Standard in-context 20.3 30.0 28.1 26.1
[63]

Query first70.5 80.5 52.5 67.8
[64]

Visual match 18.5 12.5 15.2 15.4
[65]

Images then text 5.9 20.8 8.4 11.7
[66]

Standard in-context denotes the interleaved in-context prompting used in the main paper

CoT (Thinking) 9.0 47.3 2.3 19.5 T able 13:In-context prompt exploration on three novel datasets using Qwen3-VL. Standard in-context denotes the interleaved in-context prompting used in the main paper. We experimented with both the semantic and anonymous settings
[67]

<Image:x s 1,1> Option 1:c 1

Query first prompt: <Image:x q> What is this? Match it to one of the options below. <Image:x s 1,1> Option 1:c 1. ... <Image:x s 5,1> Option 5:c 5. Which option matches the query image shown first? Choose one of: 1.c 1; ...; 5.c 5
[68]

Visual match prompt: <Image:x s 1,1> Image 1. ... <Image:x s 5,1> Image 5. <Image:x q> Which image (1-5) is most visually similar to the last image? Answer with 1-5 only. DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 27
[69]

<Image:x s 5,1> <Image:x q> Image 1 belongs to Option 1:c 1; ...; Image 5 belongs to Option 5:c 5

Images then text prompt: <Image:x s 1,1> ... <Image:x s 5,1> <Image:x q> Image 1 belongs to Option 1:c 1; ...; Image 5 belongs to Option 5:c 5. What class is in the last image? Choose one of the options (1-5)
[70]

(we use)

CoT style prompt (Thinking): <Image:x s 1,1> What is this?c 1 (option 1). ... <Image:x s 5,1> What is this?c 5 (option 5). The following image is the query image. <Image:x q> So what is this? Choose one of the options: 1.c 1; ...; 5.c 5 Think step by step, then output exactly one final line in this format: Final answer: <number> In Table 13, under the sem...

[1] [1]

arXiv preprint arXiv:2301.11932 (2023)

Al-Barham, M., Alsharkawi, A., Al-Yaman, M., Al-Fetyani, M., Elnagar, A., SaAleek, A.A., Al-Odat, M.: Rgb arabic alphabets sign language dataset. arXiv preprint arXiv:2301.11932 (2023)

work page arXiv 2023

[2] [2]

In: CVPR (2023)

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Le- Cun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: CVPR (2023)

2023

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Baldassini, F.B., Shukor, M., Cord, M., Soulier, L., Piwowarski, B.: What makes multimodal in-context learning work? In: CVPR (2024)

2024

[6] [6]

In: CVPR (2025)

Bendou, Y., Ouasfi, A., Gripon, V., Boukhayma, A.: Proker: A kernel perspective on few-shot adaptation of large vision-language models. In: CVPR (2025)

2025

[7] [7]

In: ICLR (2023)

Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: Prompt learning with optimal transport for vision-language models. In: ICLR (2023)

2023

[8] [8]

In: ICLR (2025)

Chi, Z., Gu, L., Liu, H., Wang, Z., Wu, Y., Wang, Y., Plataniotis, K.N.: Learning to adapt frozen CLIP for few-shot test-time domain adaptation. In: ICLR (2025)

2025

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

In: CVPR (2025)

Farina, M., Mancini, M., Iacca, G., Ricci, E.: Rethinking few-shot adaptation of vision-language models in two stages. In: CVPR (2025)

2025

[11] [11]

In: ICLR (2024)

Fifty, C., Duan, D., Junkins, R.G., Amid, E., Leskovec, J., Re, C., Thrun, S.: Context-aware meta-learning. In: ICLR (2024)

2024

[12] [12]

In: ICML (2017)

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

2017

[13] [13]

In: ACM MM (2013)

Franken, M., van Gemert, J.C.: Automatic egyptian hieroglyph recognition by retrieving images as texts. In: ACM MM (2013)

2013

[14] [14]

IJCV (2024)

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- adapter: Better vision-language models with feature adapters. IJCV (2024)

2024

[15] [15]

Garciam, P.: Lego brick sorting image recognition (2019), kaggle

2019

[16] [16]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022

[17] [17]

In: CVPR (2025)

Hu, Z., Wei, Y., Shen, L., Yuan, C., Tao, D.: Unlocking tuning-free few-shot adapt- ability in visual foundation models by recycling pre-tuned loras. In: CVPR (2025)

2025

[18] [18]

In: CVPR (2023)

Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. In: CVPR (2023)

2023

[19] [19]

Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs

[20] [20]

In: ICCV (2025)

Kravets, A., Chen, D., Namboodiri, V.P.: Rethinking few shot clip benchmarks: A critical analysis in the inductive setting. In: ICCV (2025)

2025

[21] [21]

In: ICCV (2021) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 17

Kukleva, A., Kuehne, H., Schiele, B.: Generalized and incremental few-shot learn- ing by explicit learning and calibration without forgetting. In: ICCV (2021) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 17

2021

[22] [22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

In: CVPR (2025)

Li, S., Liu, F., Hao, Z., Wang, X., Li, L., Liu, X., Chen, P., Ma, W.: Logits decon- fusion with clip for few-shot learning. In: CVPR (2025)

2025

[24] [24]

In: ECCV (2018)

Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without rep- resentation bias. In: ECCV (2018)

2018

[25] [25]

In: ECCV (2024)

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: Evaluating text-to-visual generation with image-to-text generation. In: ECCV (2024)

2024

[26] [26]

In: AAAI (2025)

Liu, F., Cai, W., Huo, J., Zhang, C., Chen, D., Zhou, J.: Making large vision language models to be good few-shot learners. In: AAAI (2025)

2025

[27] [27]

Fine-Grained Visual Classification of Aircraft

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[28] [28]

In: ICCV (2025)

Mitra, C., Huang, B., Chai, T., Lin, Z., Arbelle, A., Feris, R., Karlinsky, L., Darrell, T., Ramanan, D., Herzig, R.: Enhancing few-shot vision-language classification with large multimodal model features. In: ICCV (2025)

2025

[29] [29]

NVIDIA: NVIDIA H100 Tensor Core GPU Architecture (2022), whitepaper

2022

[30] [30]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In: ICCV (2019)

Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)

2019

[32] [32]

Piosenka, G.: Butterfly and moths image classification 100 species (2023), kaggle

2023

[33] [33]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

2021

[34] [34]

In: ICLR (2017)

Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)

2017

[35] [35]

Saxena, S.: Yoga pose image classification dataset (2021), kaggle

2021

[36] [36]

Schuerrle, B., Sankarappan, V.: Industrial classification dataset (2023), kaggle

2023

[37] [37]

In: NeurIPS (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

2022

[38] [38]

In: CVPR (2025)

Shvetsova, N., Nagrani, A., Schiele, B., Kuehne, H., Rupprecht, C.: Unbiasing through textual descriptions: Mitigating representation bias in video benchmarks. In: CVPR (2025)

2025

[39] [39]

In: NeurIPS (2017)

Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)

2017

[40] [40]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012

[41] [41]

In: CVPR (2018)

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: CVPR (2018)

2018

[42] [42]

Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking few-shot image classification: a good embedding is all you need? In: ECCV (2020)

2020

[43] [43]

In: ICCV (2023)

Udandarao, V., Gupta, A., Albanie, S.: Sus-x: Training-free name-only transfer of vision-language models. In: ICCV (2023)

2023

[44] [44]

In: NeurIPS (2016) 18 Y

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016) 18 Y. Wang et al

2016

[45] [45]

Wah,C.,Branson,S.,Welinder,P.,Perona,P.,Belongie,S.,etal.:Thecaltech-ucsd birds-200-2011 dataset. Tech. rep

2011

[46] [46]

In: ICCV (2025)

Yang, C.F., Yin, D., Hu, W., Ji, H., Peng, N., Zhou, B., Chang, K.W.: Verbalized representation learning for interpretable few-shot generalization. In: ICCV (2025)

2025

[47] [47]

In: CVPR (2023)

Yao, H., Zhang, R., Xu, C.: Visual-language prompt tuning with knowledge-guided context optimization. In: CVPR (2023)

2023

[48] [48]

In: CVPR (2023)

Yu, T., Lu, Z., Jin, X., Chen, Z., Wang, X.: Task residual for tuning vision-language models. In: CVPR (2023)

2023

[49] [49]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

2023

[50] [50]

In: ECCV (2022)

Zhang, R., Wei, Z., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: ECCV (2022)

2022

[51] [51]

In: ICML (2021)

Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Improving few-shot performance of language models. In: ICML (2021)

2021

[52] [52]

In: CVPR (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision- language models. In: CVPR (2022)

2022

[53] [53]

IJCV (2022)

Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)

2022

[54] [54]

In: ICCV (2023)

Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: ICCV (2023)

2023

[55] [55]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 19 A Supplementary Materials A.1 Dataset Details Dataset Classes T ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Put the query image first, followed by the support images (query first)

[57] [57]

Redefine the few-shot classification problem as an in-context visual matching task (visual match)

[58] [58]

Present the support images and query image first, followed by the text de- scription and instruction (images then text)

[59] [59]

Standardin-contextpromptinChainofThought(CoT)style,weuseQwen3- VL-Thinking-8B for this prompt, and set max_token=600. 26 Y. Wang et al. Novel Datasets Prompt Setting Y oga Hiero. Sign A vg. With semantic label Standard in-context 74.582.4 68.4 75.1

[60] [60]

Query first76.777.8 56.1 70.2

[61] [61]

Images then text 9.0 13.8 12.5 11.8

[62] [62]

CoT (Thinking) 41.7 68.4 24.1 44.7 Anonymous Standard in-context 20.3 30.0 28.1 26.1

[63] [63]

Query first70.5 80.5 52.5 67.8

[64] [64]

Visual match 18.5 12.5 15.2 15.4

[65] [65]

Images then text 5.9 20.8 8.4 11.7

[66] [66]

Standard in-context denotes the interleaved in-context prompting used in the main paper

CoT (Thinking) 9.0 47.3 2.3 19.5 T able 13:In-context prompt exploration on three novel datasets using Qwen3-VL. Standard in-context denotes the interleaved in-context prompting used in the main paper. We experimented with both the semantic and anonymous settings

[67] [67]

<Image:x s 1,1> Option 1:c 1

Query first prompt: <Image:x q> What is this? Match it to one of the options below. <Image:x s 1,1> Option 1:c 1. ... <Image:x s 5,1> Option 5:c 5. Which option matches the query image shown first? Choose one of: 1.c 1; ...; 5.c 5

[68] [68]

Visual match prompt: <Image:x s 1,1> Image 1. ... <Image:x s 5,1> Image 5. <Image:x q> Which image (1-5) is most visually similar to the last image? Answer with 1-5 only. DeCoDe: Multimodal LLMs are Implicit Few-Shot Learners 27

[69] [69]

<Image:x s 5,1> <Image:x q> Image 1 belongs to Option 1:c 1; ...; Image 5 belongs to Option 5:c 5

Images then text prompt: <Image:x s 1,1> ... <Image:x s 5,1> <Image:x q> Image 1 belongs to Option 1:c 1; ...; Image 5 belongs to Option 5:c 5. What class is in the last image? Choose one of the options (1-5)

[70] [70]

(we use)

CoT style prompt (Thinking): <Image:x s 1,1> What is this?c 1 (option 1). ... <Image:x s 5,1> What is this?c 5 (option 5). The following image is the query image. <Image:x q> So what is this? Choose one of the options: 1.c 1; ...; 5.c 5 Think step by step, then output exactly one final line in this format: Final answer: <number> In Table 13, under the sem...