Rethinking VLMs and LLMs for Image Classification

Avi Cooper; Chia-Hsien Shih; Hao-Wei Yeh; Hiroaki Yamane; Ian Mason; Jin Yamanaka; Kasper Vinken; Keizo Kato; Kentaro Takemoto; Taro Sunagawa

arxiv: 2410.14690 · v1 · pith:X3I7QJLYnew · submitted 2024-10-03 · 💻 cs.LG · cs.AI· cs.CV

Rethinking VLMs and LLMs for Image Classification

Avi Cooper , Keizo Kato , Chia-Hsien Shih , Hiroaki Yamane , Kasper Vinken , Kentaro Takemoto , Taro Sunagawa , Hao-Wei Yeh

show 3 more authors

Jin Yamanaka Ian Mason Xavier Boix

This is my paper

classification 💻 cs.LG cs.AIcs.CV

keywords llmsvisualvlmsmodelsaccuracycapabilitiesdatasetimage

0 comments

read the original abstract

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs
cs.CV 2026-07 unverdicted novelty 4.0

Compares PEFT methods (LoRA, QLoRA, BitFit etc.) plus a new adaptive checkpointing strategy on ViT/Mamba vision models and VLMs, showing 20-30% energy cuts and 43-79% memory reduction at small accuracy cost on CIFAR-100/DTD.