Revisiting mllms: An in-depth analysis of image classification abilities

Huan Liu, Lingyu Xiao, Jiangjiang Liu, Xiaofan Li, Ze Feng, Sen Yang, Jingdong Wang · 2024 · arXiv 2412.16418

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

Specificity-aware reinforcement learning for fine-grained open-world classification

cs.CV · 2026-03-03 · unverdicted · novelty 6.0

SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

cs.CV · 2026-02-07 · unverdicted · novelty 6.0

Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

HyMOR combines MLLM for coarse open-ended recognition with CLIP for fine-grained domain objects, achieving near-CLIP fine performance and 2.5% better general recognition plus 23.2% overall SBert gain on a new TBO textbook dataset.

citing papers explorer

Showing 4 of 4 citing papers.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 39
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Specificity-aware reinforcement learning for fine-grained open-world classification cs.CV · 2026-03-03 · unverdicted · none · ref 30
SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning cs.CV · 2026-02-07 · unverdicted · none · ref 16
Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained categories with 4-shot training.
Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games cs.CV · 2026-04-18 · unverdicted · none · ref 17
HyMOR combines MLLM for coarse open-ended recognition with CLIP for fine-grained domain objects, achieving near-CLIP fine performance and 2.5% better general recognition plus 23.2% overall SBert gain on a new TBO textbook dataset.

Revisiting mllms: An in-depth analysis of image classification abilities

fields

years

verdicts

representative citing papers

citing papers explorer